Monday, March 31, 2008

Understanding the Efficiency of Social Tagging Systems using Information Theory

Given the rise in popularity of social tagging systems, it seems only natural to ask how efficient is the organically evolved tagging vocabulary in describing any underlying objects?

The accumulation of human knowledge relies on innovations in novel methods of organizing information. Subject indexes, ontologies, library catalogs, Dewey decimal systems are just a few examples of how curators and users of information environments have attempted to organize knowledge. Recently, tagging has exploded as a fad in information systems to categorize and cluster information objects. Shirky argues that since tagging systems does not use a controlled vocabulary, it can easily respond to changes in the consensus of how things should be classified.

Social navigation as enabled by social tagging systems can be studied by how well the tags form a vocabulary to describe the contents being tagged. At ICWSM conference today as well as Hypertext 2008 conference coming up in June, we are reporting research on using information theory to understand social tagging datasets.

For most tagging systems the total number of tags in the collective vocabulary is much less than the total number of objects being tagged. We collected del.icio.us bookmarking data using a custom web crawler and screen scraper in late-summer 2006. We collected 9,853,345 distinct documents 140,182 users, and 118,456 users in our dataset for a total of roughly 35 milliion bookmarks. The ratio of unique documents to unique tags is almost 84. Given this multiplicity of tags to documents, a question remains: how effective are the tags at isolating any single document? Naively, if we specify a single tag in this system we would uniquely identify 84 documents--- thus the answer to our question is ``not very well!''. However this method carries a faulty assumption; not every document is equal. Some documents are more popular and important than others, and this importance is conveyed by the number bookmarks per document. Thus, we can reformulate the above question to be: how well does the mapping of tags to documents retain about the distribution of the documents?

This is where Information Theory comes in. Information theory provides a natural framework to understand the
amount of shared information between two random variables. The conditional entropy measures the amount of
entropy remaining in one random variable when we know the value of a second random variable.



The entropy of documents conditional on tags, H(D|T), is increasing rapidly. What this means is that, even after knowing completely the value of a tag, the entropy of the set of documents is increasing over time. Conditional Entropy asks the question: "Given that I know a set of tags, how much uncertainty regarding the document set that I was referencing with those tags remains?"

The fact that this curve is strictly increasing suggests that the speci city of any given tag is decreasing. That is to say, as a navigation aid, tags are becoming harder and harder to use. We are moving closer and closer to the proverbial "needle in a haystack" where any single tag references too many documents to be considered useful. "Aha!" you say, because users can respond to this by using more tags per bookmark. This way, they can specify several tags (instead of just a single one) to retrieve the exactly the document they want. If you thought that, you'd be right.



The plot here shows that the average number of tags per bookmark is around 2.8 as of late summer 2006. We have seen a similar trend in the number of query terms in search engine query logs increasing. As the size of the web increases, in order to find specific facts and items, users have to specify more keywords in order to find a specific content. The same evolutionary pressure appears to be at work here in the tagging behavior of users.

Another way to look at the data is to think about Mutual Information, which is a measure of independence between the two variables. Full independence is reached when I(D;T) = 0. As seen in here the trend is steep and quickly decreasing. As a measure of usefulness of the tags and their encoding, this suggests a worsening trend in the ability of users to specify and find tags and documents.



While our crawl at the time is probably incomplete, but this could be a reasonable method to look at the evolutionary trends of a social tagging system. More importantly, it suggests that we need to build search and recommendation systems that help users sift through resources in social tagging systems.

The references are:
Ed H. Chi, Todd Mytkowicz. Understanding the Efficiency of Social Tagging Systems using Information Theory. In Proc. of ACM Conference on Hypertext 2008. (to appear). ACM Press, 2008. Pittsburgh, PA.

(poster) Ed H. Chi, Todd Mytkowicz. Understanding the Efficiency of Social Tagging Systems using Information Theory. In Proc. of the Second International Conference on Weblogs and Social Media (ICWSM2008). Seattle, WA.

2 comments:

Brendan O'Connor said...
This comment has been removed by the author.
Brendan O'Connor said...

I guess I'm getting to this late but --

Why is tag-document mutual information or conditional entropy a good measure of the usefulness of the tag system?

The best way to optimize the measure is to have a small number of tags that apply to only a few documents each and rarely overlap. Then when you search on a given tag you get a small set of documents, and when you search on a combination of tags the intersection is much smaller and it's an even smaller set of documents. (Low entropy)

But then you would then miss lots of documents that are potentially useful.

Here's a different interpretation: In reality, documents are diffuse and many different ones might be relevant to a given topic. With more tags from more users, the space of tags is converging on a truer description of the information -- which happens to be higher conditional entropy. The low mutual information of sparse, incomplete tags is an illusion of relevance.

I have no idea whether this interpretation is true. But it's just as consistent with your folks' MI results as the interpretation that high cond entropy is less "useful" for users.