Monday, March 31, 2008

Understanding the Efficiency of Social Tagging Systems using Information Theory

Given the rise in popularity of social tagging systems, it seems only natural to ask how efficient is the organically evolved tagging vocabulary in describing any underlying objects?

The accumulation of human knowledge relies on innovations in novel methods of organizing information. Subject indexes, ontologies, library catalogs, Dewey decimal systems are just a few examples of how curators and users of information environments have attempted to organize knowledge. Recently, tagging has exploded as a fad in information systems to categorize and cluster information objects. Shirky argues that since tagging systems does not use a controlled vocabulary, it can easily respond to changes in the consensus of how things should be classified.

Social navigation as enabled by social tagging systems can be studied by how well the tags form a vocabulary to describe the contents being tagged. At ICWSM conference today as well as Hypertext 2008 conference coming up in June, we are reporting research on using information theory to understand social tagging datasets.

For most tagging systems the total number of tags in the collective vocabulary is much less than the total number of objects being tagged. We collected bookmarking data using a custom web crawler and screen scraper in late-summer 2006. We collected 9,853,345 distinct documents 140,182 users, and 118,456 users in our dataset for a total of roughly 35 milliion bookmarks. The ratio of unique documents to unique tags is almost 84. Given this multiplicity of tags to documents, a question remains: how effective are the tags at isolating any single document? Naively, if we specify a single tag in this system we would uniquely identify 84 documents--- thus the answer to our question is ``not very well!''. However this method carries a faulty assumption; not every document is equal. Some documents are more popular and important than others, and this importance is conveyed by the number bookmarks per document. Thus, we can reformulate the above question to be: how well does the mapping of tags to documents retain about the distribution of the documents?

This is where Information Theory comes in. Information theory provides a natural framework to understand the
amount of shared information between two random variables. The conditional entropy measures the amount of
entropy remaining in one random variable when we know the value of a second random variable.

The entropy of documents conditional on tags, H(D|T), is increasing rapidly. What this means is that, even after knowing completely the value of a tag, the entropy of the set of documents is increasing over time. Conditional Entropy asks the question: "Given that I know a set of tags, how much uncertainty regarding the document set that I was referencing with those tags remains?"

The fact that this curve is strictly increasing suggests that the speci city of any given tag is decreasing. That is to say, as a navigation aid, tags are becoming harder and harder to use. We are moving closer and closer to the proverbial "needle in a haystack" where any single tag references too many documents to be considered useful. "Aha!" you say, because users can respond to this by using more tags per bookmark. This way, they can specify several tags (instead of just a single one) to retrieve the exactly the document they want. If you thought that, you'd be right.

The plot here shows that the average number of tags per bookmark is around 2.8 as of late summer 2006. We have seen a similar trend in the number of query terms in search engine query logs increasing. As the size of the web increases, in order to find specific facts and items, users have to specify more keywords in order to find a specific content. The same evolutionary pressure appears to be at work here in the tagging behavior of users.

Another way to look at the data is to think about Mutual Information, which is a measure of independence between the two variables. Full independence is reached when I(D;T) = 0. As seen in here the trend is steep and quickly decreasing. As a measure of usefulness of the tags and their encoding, this suggests a worsening trend in the ability of users to specify and find tags and documents.

While our crawl at the time is probably incomplete, but this could be a reasonable method to look at the evolutionary trends of a social tagging system. More importantly, it suggests that we need to build search and recommendation systems that help users sift through resources in social tagging systems.

The references are:
Ed H. Chi, Todd Mytkowicz. Understanding the Efficiency of Social Tagging Systems using Information Theory. In Proc. of ACM Conference on Hypertext 2008. (to appear). ACM Press, 2008. Pittsburgh, PA.

(poster) Ed H. Chi, Todd Mytkowicz. Understanding the Efficiency of Social Tagging Systems using Information Theory. In Proc. of the Second International Conference on Weblogs and Social Media (ICWSM2008). Seattle, WA.

Wednesday, March 12, 2008

Charlene Li writes and talks about the groundswell and the power of social applications

Charlene Li of Forrester recently came to PARC and gave an excellent talk on the transforming power of social applications on businesses.

She defined the "groundswell" as the "social trend in which people use technologies to get the things they need from each other, rather than from traditional institutions like corporations". She illustrates this idea with examples. She writes in the MIT Sloan Management Review recently: Brian Finkelstein recently made an YouTube video of a Comcast repairman who was put on hold with a call back to the home office for so long that he fell asleep on Brian's couch. This video was an instant hit with over 1 million viewings and counting.

Take the opposite example that happened with the CBS tv series "Jericho". After CBS canceled the show, fans organized themselves and sent 20 tons of peanuts to the president of CBS entertainment, taking the cue from a character in the show who loved using the phrase "nuts!" CBS listened and engaged with the online fans and asked them to watch the re-launched show to help boost its ratings.

In the talk, she also mentioned examples of companies such as Dell using idea markets to engage customers directly. But of course, there are risks. She mentions the loss of control as the major risk. As an example, when Walmart created a social application on Facebook, it became a magnet for anti-Walmart comments and discussions. For ways to mitigate these risks, you will have to read the MIT Sloan article and watch the video:

Monday, March 10, 2008

How to reduce the cost of doing user studies with Crowdsourcing

One problem we have been facing as HCI researchers is how to get user data such as their opinions or relevance judgements quickly and cheaply. I think we may have a good way of doing this with Amazon Mechanical Turk with crowdsourcing that we're about to report in the CHI2008 conference.

User studies are important for many aspects of the design process and involve techniques ranging from informal surveys to rigorous laboratory studies. However, the costs involved in engaging users often requires practitioners to trade off between sample size, time requirements, and monetary costs. In particular, collecting input from only a small set of participants is problematic in many design situations. In usability testing, many issues and errors (even large ones) are not easily caught with a small number of participants, as we have learned from people like Jared Spool at UIE.

Recently, we investigate the utility of a micro-task market for collecting user measurements. Micro-task markets, such as Amazon’s Mechanical Turk, offer a potential paradigm for engaging a large number of users for low time and monetary costs. Although micro-task markets have great potential for rapidly collecting user measurements at low costs, we found that special care is needed in formulating tasks in order to harness the capabilities of the approach. The special care turns out to mirror the game theoretic issues somewhat reminiscent of Luis von Ahn's work on ESP Games.

We conducted two experiments to test the utility of Mechanical Turk as a user study platform. Here is a quick summary:

In both experiments, we used tasks that collected quantitative user ratings as well as qualitative feedback regarding the quality of Wikipedia articles. We had Mechanical Turk users rate a set of 14 Wikipedia articles, and then compared their ratings to an expert group of Wikipedia administrators from a previous experiment. We had users rate articles on a 7-point Likert-scale according to a set of factors including how well written, factually accurate, neutral, well structured, and overall high quality the article was.

In one experiment, users were required to fill out a free-form text box describing what improvements they thought the article needed. 58 users provided 210 ratings for 14 articles (i.e., 15 ratings per article). User response was extremely fast, with 93 of the ratings received in the first 24 hours after the task was posted, and the remaining 117 received in the next 24 hours. However, in this first experiment, only about 41.4% of the responses appeared to be honest effort in rating the Wikipedia articles. An examination of the time taken to complete each rating also suggested gaming, with 64 ratings completed in less than 1 minute (less time than likely needed for reading the article, let along rating it). 123 (58.6%) ratings were flagged as potentially invalid based either on their comments or duration. However, many of the invalid responses were due to a small minority of users. So this appears to demonstrate the susceptibility of Mechanical Turk to malicious user behavior.

So in a second experiment, we tried a different design. The new design was intended to make creating believable invalid responses as effortful as completing the task in good faith. The task was also designed such that completing the known and verifiable portions would likely give the user sufficient familiarity with the content to accurately complete the subjective portion (the quality rating). For example, these questions required users to input how many references, images, and sections the article had. In addition, users were required to provide 4-6 keywords that would give someone a good summary of the contents of the article, which we can verify quickly later.

Instead of about 60% bad responses, there were dramatically fewer responses that appeared invalid. Only 7 responses had meaningless, incorrect, or copy-and-paste summaries, versus 102 in Experiment 1. We also had a positive correlation with the quality rating given to us by Wikipedia administrators, and was statistically significant (r=0.66, p=0.01).

These results suggest that micro-task markets may be useful as a crowdsourcing tool for other types of user study tasks that combine objective and subjective information gathering, but there are design considerations. While hundreds of users can be recruited for highly interactive tasks for marginal costs within a timeframe of days or even minutes, however, special care must be taken in the design of the task, especially for user measurements that are subjective or qualitative.

Reference is:
Kittur, A., Chi, E., Suh, B. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008). ACM Press, 2008. Florence, Italy.

Paper is here.