Monday, March 10, 2008

How to reduce the cost of doing user studies with Crowdsourcing


One problem we have been facing as HCI researchers is how to get user data such as their opinions or relevance judgements quickly and cheaply. I think we may have a good way of doing this with Amazon Mechanical Turk with crowdsourcing that we're about to report in the CHI2008 conference.


User studies are important for many aspects of the design process and involve techniques ranging from informal surveys to rigorous laboratory studies. However, the costs involved in engaging users often requires practitioners to trade off between sample size, time requirements, and monetary costs. In particular, collecting input from only a small set of participants is problematic in many design situations. In usability testing, many issues and errors (even large ones) are not easily caught with a small number of participants, as we have learned from people like Jared Spool at UIE.


Recently, we investigate the utility of a micro-task market for collecting user measurements. Micro-task markets, such as Amazon’s Mechanical Turk, offer a potential paradigm for engaging a large number of users for low time and monetary costs. Although micro-task markets have great potential for rapidly collecting user measurements at low costs, we found that special care is needed in formulating tasks in order to harness the capabilities of the approach. The special care turns out to mirror the game theoretic issues somewhat reminiscent of Luis von Ahn's work on ESP Games.

We conducted two experiments to test the utility of Mechanical Turk as a user study platform. Here is a quick summary:

In both experiments, we used tasks that collected quantitative user ratings as well as qualitative feedback regarding the quality of Wikipedia articles. We had Mechanical Turk users rate a set of 14 Wikipedia articles, and then compared their ratings to an expert group of Wikipedia administrators from a previous experiment. We had users rate articles on a 7-point Likert-scale according to a set of factors including how well written, factually accurate, neutral, well structured, and overall high quality the article was.

In one experiment, users were required to fill out a free-form text box describing what improvements they thought the article needed. 58 users provided 210 ratings for 14 articles (i.e., 15 ratings per article). User response was extremely fast, with 93 of the ratings received in the first 24 hours after the task was posted, and the remaining 117 received in the next 24 hours. However, in this first experiment, only about 41.4% of the responses appeared to be honest effort in rating the Wikipedia articles. An examination of the time taken to complete each rating also suggested gaming, with 64 ratings completed in less than 1 minute (less time than likely needed for reading the article, let along rating it). 123 (58.6%) ratings were flagged as potentially invalid based either on their comments or duration. However, many of the invalid responses were due to a small minority of users. So this appears to demonstrate the susceptibility of Mechanical Turk to malicious user behavior.

So in a second experiment, we tried a different design. The new design was intended to make creating believable invalid responses as effortful as completing the task in good faith. The task was also designed such that completing the known and verifiable portions would likely give the user sufficient familiarity with the content to accurately complete the subjective portion (the quality rating). For example, these questions required users to input how many references, images, and sections the article had. In addition, users were required to provide 4-6 keywords that would give someone a good summary of the contents of the article, which we can verify quickly later.

Instead of about 60% bad responses, there were dramatically fewer responses that appeared invalid. Only 7 responses had meaningless, incorrect, or copy-and-paste summaries, versus 102 in Experiment 1. We also had a positive correlation with the quality rating given to us by Wikipedia administrators, and was statistically significant (r=0.66, p=0.01).

These results suggest that micro-task markets may be useful as a crowdsourcing tool for other types of user study tasks that combine objective and subjective information gathering, but there are design considerations. While hundreds of users can be recruited for highly interactive tasks for marginal costs within a timeframe of days or even minutes, however, special care must be taken in the design of the task, especially for user measurements that are subjective or qualitative.

Reference is:
Kittur, A., Chi, E., Suh, B. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008). ACM Press, 2008. Florence, Italy.

Paper is here.

1 comment: