Tuesday, February 24, 2009

Announcing MrTaggy.com: a Tag-based Exploration and Search System


I'm pleased to announce MrTaggy.com, a tag-based exploration and search system for bookmarked content on the Web. The tagline for the project is "An interactive guide to what's useful on the Web", since all of the content has been socially vetted (i.e. someone found it useful enough to bookmark it.)

MrTaggy is an experiment in web search and exploration built on top of a PARC algorithm called TagSearch. Think of MrTaggy as a cross between a search engine and a recommendation engine: it’s a web browsing guide constructed from social tagging data. We have collected about 150 million bookmarks from around the Web.

Unlike most search engines, MrTaggy doesn’t index the text on a web page. Instead, it leverages the knowledge contained in the tags that people add to web pages when using social bookmarking services. Tags describe both the content and context of a web page, and we use that information to deliver relevant contents.

The problem with using social tags is that they contain a lot of noise, because people often use different words to mean the same thing or the same words to mean different things. The TagSearch algorithm is part of our ongoing research to reduce the noise while amplifying the information signal from social tags.

We also designed a novel search UI to explore the tag space. The Related Tags sidebar outlines the content landscape to help you understand the space. The relevance feedback capabilities enable you to tell the system both positive and negative cues about directions where you want to go. Try clicking on the Thumbs Up and Down to give feedback to MrTaggy about the tags or results that you liked, and see how your rating changes the result set on-the-fly. At the top of the result set, we have also provided top search results from Yahoo's search engine when we think the results there might help you.

Enterprise Use

In addition to exploring TagSearch in the consumer space, we have also explored the use of TagSearch in the enterprise social tagging and intranet search systems. Surprisingly, the algorithm worked well even with a small amount of data (<50,000 bookmarks). For enterprise licensing of the underlying technology and API, contact Lawrence Lee, Director of Business Development, at lawrence.lee [at] parc [dot] com.

We would appreciate your feedback (comment on the blog here), or send them to mrtaggy [at] parc [dot] com, or submit at mrtaggy.uservoice.com.

Click here to try MrTaggy.com

6 comments:

Brendan O'Connor said...

Unlike most search engines, MrTaggy doesn’t index the text on a web page. Instead, it leverages the knowledge contained in the tags that people add to web pages when using social bookmarking services.

Search engines rely heavily on indexing anchor text, which is sort of social/tag-like. The interesting question is whether delicious-style tags improve search quality beyond anchor text matching.

The problem with using social tags is that they contain a lot of noise, because people often use different words to mean the same thing or the same words to mean different things.

The same things are true of page text, title text, h1 tag text, etc. Is there evidence that social tags have less noise, or a different type of noise, compared to original content? Compared to anchor text?

Ed H. Chi said...

Anchor text are authored by the original page creator. This means she will only think of words or concepts that she believes are related. For example, maybe the page is about the "iraq war", and the anchor text for the link is the same, but others might think that in fact it is quite related to "troops allocation".

Social tagging relies on others to find all of the concepts that might be related.

Brendan O'Connor said...

No. By anchor text I mean, *inbound* anchor text from pages on the web that link to the page in question. So it's what *other* people on the web think about the page. This information is awfully similar to social tagging.

This is the really useful feature for commercial web search engines; supposedly, more useful than pagerank or other doc authority measures. The story is, this is what made Inktomi and Google so much better than AltaVista circa 2001 or whenever it was they achieved big relevance gains.

The apocryphal story I like is, early Google was going to all the trouble of finding all inbound links per page in order to run pagerank, then someone noticed they had the inbound anchor text data lying there and started using it, and it was more useful than that overhyped pagerank algorithm their company was supposedly based on.

Rowan Nairn said...

Ed, in this case anchor text is referring to text on *other* pages that is marked up as a link to the current page. Brendan is right that it is very analogous to bookmark tags, and it's not clear what the relative benefits are. IIRC Paul Heymann's work didn't do the comparison to anchor text, just page text. This was a question I had from the start of the Mr Taggy project but we never managed to address it.

Partly the reason Mr Taggy is based on tags is the ease of collecting the corpus. So far I haven't felt up to collecting a large-scale web corpus. But there's a few organizations now that may take care of a lot of that work for us. I've been thinking about this a bunch, even how we could go beyond anchor text.

Brendan O'Connor said...

Hey Rowan. Indeed, web data is a *huge* pain. If you're interested in it, make sure to check out Jamie Callan's new ClueWeb09 dataset (from here, i'm at cmu now) ... http://boston.lti.cs.cmu.edu/Data/clueweb09/

Ed H. Chi said...

Ah. Since you had talked about the title text and h1 tag text, I had misunderstood your original intended meaning about *incoming* anchor texts.

Absolutely, incoming anchor text are probably very comparable to the social tags. That's a great research question, and definitely worth looking into.

I knew about CMU's data set, but haven't dug into it. Thanks for the pointer. I'll talk to Rowan to see if it's worth pursuing.