Augmented Social Cognition Research Blog from PARC: collaborative tagging

Showing posts with label collaborative tagging. Show all posts

Tuesday, February 24, 2009

Announcing MrTaggy.com: a Tag-based Exploration and Search System

I'm pleased to announce MrTaggy.com, a tag-based exploration and search system for bookmarked content on the Web. The tagline for the project is "An interactive guide to what's useful on the Web", since all of the content has been socially vetted (i.e. someone found it useful enough to bookmark it.)

MrTaggy is an experiment in web search and exploration built on top of a PARC algorithm called TagSearch. Think of MrTaggy as a cross between a search engine and a recommendation engine: it’s a web browsing guide constructed from social tagging data. We have collected about 150 million bookmarks from around the Web.

Unlike most search engines, MrTaggy doesn’t index the text on a web page. Instead, it leverages the knowledge contained in the tags that people add to web pages when using social bookmarking services. Tags describe both the content and context of a web page, and we use that information to deliver relevant contents.

The problem with using social tags is that they contain a lot of noise, because people often use different words to mean the same thing or the same words to mean different things. The TagSearch algorithm is part of our ongoing research to reduce the noise while amplifying the information signal from social tags.

We also designed a novel search UI to explore the tag space. The Related Tags sidebar outlines the content landscape to help you understand the space. The relevance feedback capabilities enable you to tell the system both positive and negative cues about directions where you want to go. Try clicking on the Thumbs Up and Down to give feedback to MrTaggy about the tags or results that you liked, and see how your rating changes the result set on-the-fly. At the top of the result set, we have also provided top search results from Yahoo's search engine when we think the results there might help you.

Enterprise Use

In addition to exploring TagSearch in the consumer space, we have also explored the use of TagSearch in the enterprise social tagging and intranet search systems. Surprisingly, the algorithm worked well even with a small amount of data (<50,000 bookmarks). For enterprise licensing of the underlying technology and API, contact Lawrence Lee, Director of Business Development, at lawrence.lee [at] parc [dot] com.

We would appreciate your feedback (comment on the blog here), or send them to mrtaggy [at] parc [dot] com, or submit at mrtaggy.uservoice.com.

Click here to try MrTaggy.com

Monday, March 31, 2008

Understanding the Efficiency of Social Tagging Systems using Information Theory

Given the rise in popularity of social tagging systems, it seems only natural to ask how efficient is the organically evolved tagging vocabulary in describing any underlying objects?

The accumulation of human knowledge relies on innovations in novel methods of organizing information. Subject indexes, ontologies, library catalogs, Dewey decimal systems are just a few examples of how curators and users of information environments have attempted to organize knowledge. Recently, tagging has exploded as a fad in information systems to categorize and cluster information objects. Shirky argues that since tagging systems does not use a controlled vocabulary, it can easily respond to changes in the consensus of how things should be classified.

Social navigation as enabled by social tagging systems can be studied by how well the tags form a vocabulary to describe the contents being tagged. At ICWSM conference today as well as Hypertext 2008 conference coming up in June, we are reporting research on using information theory to understand social tagging datasets.

For most tagging systems the total number of tags in the collective vocabulary is much less than the total number of objects being tagged. We collected del.icio.us bookmarking data using a custom web crawler and screen scraper in late-summer 2006. We collected 9,853,345 distinct documents 140,182 users, and 118,456 users in our dataset for a total of roughly 35 milliion bookmarks. The ratio of unique documents to unique tags is almost 84. Given this multiplicity of tags to documents, a question remains: how effective are the tags at isolating any single document? Naively, if we specify a single tag in this system we would uniquely identify 84 documents--- thus the answer to our question is ``not very well!''. However this method carries a faulty assumption; not every document is equal. Some documents are more popular and important than others, and this importance is conveyed by the number bookmarks per document. Thus, we can reformulate the above question to be: how well does the mapping of tags to documents retain about the distribution of the documents?

This is where Information Theory comes in. Information theory provides a natural framework to understand the
amount of shared information between two random variables. The conditional entropy measures the amount of
entropy remaining in one random variable when we know the value of a second random variable.

The entropy of documents conditional on tags, H(D|T), is increasing rapidly. What this means is that, even after knowing completely the value of a tag, the entropy of the set of documents is increasing over time. Conditional Entropy asks the question: "Given that I know a set of tags, how much uncertainty regarding the document set that I was referencing with those tags remains?"

The fact that this curve is strictly increasing suggests that the specicity of any given tag is decreasing. That is to say, as a navigation aid, tags are becoming harder and harder to use. We are moving closer and closer to the proverbial "needle in a haystack" where any single tag references too many documents to be considered useful. "Aha!" you say, because users can respond to this by using more tags per bookmark. This way, they can specify several tags (instead of just a single one) to retrieve the exactly the document they want. If you thought that, you'd be right.

The plot here shows that the average number of tags per bookmark is around 2.8 as of late summer 2006. We have seen a similar trend in the number of query terms in search engine query logs increasing. As the size of the web increases, in order to find specific facts and items, users have to specify more keywords in order to find a specific content. The same evolutionary pressure appears to be at work here in the tagging behavior of users.

Another way to look at the data is to think about Mutual Information, which is a measure of independence between the two variables. Full independence is reached when I(D;T) = 0. As seen in here the trend is steep and quickly decreasing. As a measure of usefulness of the tags and their encoding, this suggests a worsening trend in the ability of users to specify and find tags and documents.

While our crawl at the time is probably incomplete, but this could be a reasonable method to look at the evolutionary trends of a social tagging system. More importantly, it suggests that we need to build search and recommendation systems that help users sift through resources in social tagging systems.

The references are:
Ed H. Chi, Todd Mytkowicz. Understanding the Efficiency of Social Tagging Systems using Information Theory. In Proc. of ACM Conference on Hypertext 2008. (to appear). ACM Press, 2008. Pittsburgh, PA.

(poster) Ed H. Chi, Todd Mytkowicz. Understanding the Efficiency of Social Tagging Systems using Information Theory. In Proc. of the Second International Conference on Weblogs and Social Media (ICWSM2008). Seattle, WA.

ht08

Monday, October 29, 2007

Differences between Social Tagging and Collaborative Tagging

I'm here at the InfoVis conference in Sacramento and a conversation with Marti Hearst over at UCBerkeley just reminded me why I have been bothered by the 'confusion' between the phrases "social tagging" and "collaborative tagging" for quite some time. In fact, Wikipedia has a redirection of "Social Tagging" to "Collaborative Tagging" (see http://en.wikipedia.org/w/index.php?title=Social_tagging&redirect=no). This, I would argue, is wrong. Why?

'Collaborate', according to the American Heritage Dictionary, is "to work together, especially in a joint intellectual effort." The problem is that tagging features in many of the popular Web2.0 tools such as Flickr and YouTube are not really 'collaborative', since users aren't really working together per se. In YouTube, for example, only the uploader of the original video clip can specify and edit the tags for an video. Most of the time, in Flickr, one only tag their own photos. However, Flickr is somewhat more collaborative than YouTube because the default setting for any account is to allow contacts such as friends and families to also tag the photos.

Both of these two systems don't seem that 'collaborative', because, to me, collaboration implies shared artifact, shared workspace, and shared work. On the other hand, 'social' is "living or disposed to live in companionship with others or in a community, rather than in isolation". In other words, simply existing and having some relation to others in a community. So for example, I would argue that in YouTube, we have social tagging but not collaborative tagging, because while users tag their uploaded videos in the context of a online social community, and they do not collaborate to converge on a set of tags appropriate for that video.

The use of the term 'collaborative' in past Computer-Supported Cooperative Work (CSCW) field has especially come to imply
a shared workspace. With shared workspaces, often there are some elements of coordination and conflicts involved as well (and hopefully conflict resolution as well). So in contrast to YouTube, the most 'collaborative' tagging system I know is the category tagging system in Wikipedia. Anyone can edit the category tags for an article. They can remove, add, discuss, and revert the use of any tag. In this case, the category tags are shared artifacts that anyone can edit inside a shared workspace. The work of tagging all 2 Million+ articles in Wikipedia is shared work among the community.

It's perhaps interesting to note that somewhere in between YouTube and Wikipedia tagging is perhaps the bookmarking system del.icio.us. In del.icio.us, there is a shared artifact (the tagged sites or URLs), and there is shared work of tagging all of the websites and pages out there on the Web. However, there is less of a notion of a shared workspace. My tags for an URL could be and probably is different from someone else's tags for the same URL. I also have the capability of searching within just my own del.icio.us space. So from least collaborative to the most collaborative, we have YouTube, then del.icio.us, and then finally the category tagging system in Wikipedia.

A simple way to explain this is that one must be social in order to collaborate, but one need not be collaborative to be social. So in summary, I would argue that social tagging is a superset of collaborative tagging. But a social tagging system may not necessarily be a collaborative tagging system. We should change the definitions in Wikipedia to distinguish between these two types of systems.