Friday, July 17, 2009

Historical Roots behind TagSearch and MrTaggy

Boing Boing recently covered our work on TagSearch algorithm, and the MrTaggy prototype. We built the prototype to show how ideas from search in the past are relevant in the new world of "social search".

Boing Boing published the following response after asking us about the historical roots of TagSearch algorithm and the MrTaggy UI:

Several pieces of earlier important PARC work inspired the TagSearch algorithm and MrTaggy's user interface and experience.

First, one of the most efficient ways of browsing and navigating toward a desired information space was illustrated by the pioneering research on Scatter/Gather, a collaborative project on large-scale document space navigation between amazing researchers such as Doug Cutting (of Lucene, Hadoop fame) and Jan Pedersen (chief scientist at AltaVista, Yahoo, Microsoft for search). The research done in early to mid 90s, showed how a textual clustering algorithm can be used to quickly divide up an information space (scatter step), ask the user to specify which subspaces they're interested in (gather step). By iterating over this process, one can very quickly narrow down to just the subset of information items they're interested in. Think of it as playing 20 questions with the computer.

Second, also around the mid-90s, an important information access theory was being developed at PARC in our research group called Information Foraging, which showed that you can mathematically model the way people seek information using the same ecological equations used to model how animals forage for food. We noticed that we can use information foraging ideas to model how people used Scatter/Gather to browse for information. It turns out that it was possible to predict how people use the information cues (which we called 'information scent') in each cluster to determine whether they were interested in the contents inside the cluster. It turns out that Scatter/Gather can be shown to be a very efficient way to communicate to the user the topic structure of a very large document collection. In other words, people learned the structure of the information space much more efficiently using Scatter/Gather interfaces.

I hope it is quite clear that the relevance feedback mechanisms are very much inspired by Scatter/Gather. The related tags communicate the topic structure of what's available in the collection. Through this process, we designed MrTaggy, hoping that it would be just as efficient as Scatter/Gather in communicating the topic structure of the space.

Third, our group had developed Information Scent algorithms and concepts to build real search and recommendation systems. These algorithms build upon earlier work on a human memory model called Spreading Activation. TagSearch algorithm uses similar concepts here. It constructs a kind of Bayesian modeling of the topic space using the tag co-occurrence patterns. TagSearch's algorithm owes its heart and soul in concepts in Spreading Activation, which helps us find documents that are related to certain tags, and vice versa.

2 comments:

Jodi Schneider said...

Really interesting, Ed! This reminds me of Marcia Bates' berrypicking paper: http://www.gseis.ucla.edu/faculty/bates/berrypicking.html I'd love to see this history and TagSearch/MrTaggy written up & published for a library audience.

Ed H. Chi said...

Jodi:

I would love to do this, but it is hard since I don't know what are the appropriate publication forum for this kind of history POV article. Do you have some suggestions?

Bates' work is very influential, indeed. Our work somewhat connects with it for sure.

Ed