Friday, September 3, 2010

Revamping WikiDashboard

I released WikiDashboard almost three years ago. Believe it or not, the server for WikiDashboard has been running under my desk for three full years (the photo shows the actual server). It was launched in a rush to meet a deadline for an academic paper that we published at a conference (ACM SIGCHI 2008) and limited maintenance has been done so far.

The old Power Mac (http://en.wikipedia.org/wiki/Power_Mac_G5 ) has been pretty reliable but it is becoming increasingly untrustworthy lately. Frustrated with frequent crashes, hangs, and sluggishness, I finally decided to do something. As I’m migrating the tool out of the old machine, I’ve added a few new features. I hope you find it useful.


Faster and scalable infrastructure
The server is now running on Google App Engine. WikiDashboard is hosted as a web app on the same systems that power Google applications. WikDashboard should provide faster, reliable, and scalable service to you. I plan to keep the old server running for a bit but it will eventually forward the traffic to the new server.

Support ten more languages
Thank you to everyone who showed interest in having WikiDashboard in your own language version!

Bongwon Suh
http://www.parc.com/suh
@billsuh http://twitter.com/billsuh

Thursday, September 2, 2010

Open data manipulation and visualization: Challenges

I typically blog about research results here, but here is one post that's more conversational, and discussion oriented.  My good friend m.c. shraefel asked me a question via email: "What are 1 or 2 key priorities you think must be addressed that will aid citizen focused manipulation of open data sources for personal/social knowledge building?"


Here is my answer to her:

The issues you raised was precisely the inspiration for my Ph.D. Thesis work on creating a visualization spreadsheet.  From over 10 years ago, the idea was that if people can easily use spreadsheets, then they ought to be able to take that model further and start creating visualizations using them, and the thesis was an exploration to find out how to design such systems.  I think of ManyEyes, and Jeff Heer's later works to be in the same direction.

We have since learned a lot about user contributed content on systems like Wikipedia, Delicious, Twitter, and they show a very interesting participation architecture that consists of readers, contributors, and leaders.  Not all users want to be leaders, and not all users want to contribute.  We have sometimes use the derogatory term of "lurkers" to describe "readers", which I think is a bit unfair.  Ronald Burt's work have shown that a lot of us would like to be brokers of information among social groups, but there are also need for an audience, or followers, who might become brokers later, but not everyone all at once.

I believe that data manipulation of open data sources to follow the same curve.  Yes, some cancer patients will want to read all they can about their condition, and do the analytical work, and others (not necessarily because of tool limitations) would prefer to take a backseat, and let others curate the information for them.  What's interesting is that they might want very simple interactions that enable for basic sorting of data, or maybe even services that interpret the data for them (e.g. doctors), but they would prefer someone else does the bulk of the work (even if it becomes very easy, due to tool research and development).

Given that, what can we do?

First, it's quite clear that much of the hard work remains in data import and cleaning.  To democratize data analytics and manipulation, the bulk of the difficulty is dealing with data acquisition.  Unfortunately, most of this is engineering and not sexy research, so there aren't really innovative work in this area, but some information extraction (AI-style algorithms, and some machine learning techniques) are making some inroad in this area.  I also believe that mixed-initiative research for data import is sorely needed.  We're doing a bit of this work in my lab at the moment.

Second, there is the issue of data literacy. What kind of visualization works with what kind of data? What analytic technique is appropriate.  Early work by Jock Mackinlay (from our old UIR research group) pointed to the possibility of automating some of these design choices in his Ph.D. research, and we haven't made a huge amount of progress in this area since then.  He is now at Tableau software trying to solve some of these issues.  Wizards, try-visualization-refine loops have all been tried in research.  We need to stop inventing new visualizations, but actual usable tools for people here.  By going to vertical domains, we will learn how to solve this problem.


These two are the biggest problems, IMHO. Of course, there are other technical challenges such as data scale and compute power, security, privacy, and social sharing, which are all fascinating, and research such as ManyEyes have done a lot to teach us a few things about these issues.

Monday, August 16, 2010

Want to be Retweeted? Add URLs to Your Tweets!

In my previous post, I described a recent study [1] in which we found that including hashtags in a tweet may enhance the retweetability of the tweet. In this post, I will focus on another factor that might affect the retweetability: URL.

As reported in my previous post, we collected a random sample of public tweets from Twitter's Spritzer feed over a 7-week period, yielding about 74 million tweets. From these tweets, we identified 8.24 million of them as retweets. That is, 11.1% of the 74 million tweets are retweets.

Next, we searched for those tweets and retweets that contain at least one URL. We found that 21.1% of tweets and 28.4% of retweets include URLs, suggesting that a tweet with URLs is more likely to get retweeted.

We further investigated whether the retweetability of a tweet has anything to do with the type of website it refers to. Since most of the URLs included in tweets are shortened URLs, we first expanded the abbreviated URLs into their original URLs, and then extracted the domain names from the original URLs. For example, given an abbreviated URL http://bit.ly/c1htE cited by a tweet, we first unshortened it to http://en.wikipedia.org/wiki/URL_shortening, and then extracted the domain name of en.wikipedia.org. The URL domains are indicative of the type of content sources visited and shared by Twitter users.

Analyzing the 74 million tweets, we identified the 20 most popular URL domains referred to in our tweets and the number of tweets containing each URL domain:

Rank URL Domain
Number of Tweets


1 twitpic.com 793,680





2 myloc.me

533,082



3 www.facebook.com

481,349




4 www.youtube.com

475,509





5 formspring.me

455,377





6 www.twitlonger.com

349,760





7 tweetphoto.com

258,049




8 youtu.be

196,557




9 twitcam.com

159,684







10 url4.eu







145,656







11 twitter.com

144,002

12 www.plurk.com



127,037







13 fun140.com



113,153



14 www.formspring.me



100,111



15 bit.ly



94,505



16 foursquare.com



90,328


17 www.ustream.tv



83,486



18 tinychat.com



80,406





19 blip.fm



74,647



20 www.funwebsites.org



52,148





On the other hand, the following table shows the 20 most popular URL domains cited in our 8.24 million retweets and the number of retweets containing each URL domain:
Rank URL Domain

Number of Retweets
1 www.twitlonger.com

236,435



2 twitpic.com

129,692



3 myloc.me

121,950



4 www.youtube.com

79,404



5 www.facebook.com

55,186


6 tweetphoto.com

49,676



7 twitter.com

39,127



8 mashable.com

17,778


9 bit.ly

16,406



10 www.ustream.tv







9,638





11 www.nytimes.com



9,035





12 shar.es



8,636





13 url4.eu





8,213





14 dealspl.us



8,186





15 www.flickr.com



7,599




16 www.cnn.com



7,537





17 youtu.be



7,508





18 www.etsy.com



6,828







19 ax.itunes.apple.com



6,346





20 www.huffingtonpost.com



6,332






As can be seen, these two lists of URL domains do not match each other exactly. For example, formspring.me appears only in the first list, while mashable.com appears only in the second list. That is, the fact that a website is frequently cited in the tweets does not guarantee that it is also frequently referred to in the reweets, and vice versa.

For each URL domain, we computed a retweet rate by dividing the number of retweets containing the domain by the number of tweets containing the domain. We then normalized the rate so that a value of 1.0 represents the average retweet rate of 11.1%. For example, for twitpic.com, the retweet rate of 1.47 was calculated as (129,692/793,680)*(74/8.24). A URL domain with a retweet rate higher than 1.0 indicates that, compared to the average case, the tweets containing this domain have a higher chance of getting retweeted. The following table shows the retweet rates for the 10 most popular URL domains cited in our tweets:
Rank URL Domain

Retweet Rate
1 twitpic.com

1.47

2 myloc.me

2.05
3 www.facebook.com

1.03

4 www.youtube.com

1.50

5 formspring.me

0.05
6 www.twitlonger.com

6.07



7 tweetphoto.com

1.73

8 youtu.be

0.34



9 twitcam.com

0.12



10 url4.eu







0.51






As can be seen from the above table, the retweet rates vary greatly depending on the URL domains. For example, formspring.me, which is the 5th most popular domain, has a retweet rate of 0.05, suggesting that tweets containing that domain are very unlikely to be retweeted. On the other hand, the retweet rate of twitlonger.com is 6.07, suggesting that tweets containing that domain have high retweetability.

In the following plot, we show the retweet rates of the 50 most popular URL domains. The X-axis is the popularity rank of URL domains based on how many tweets contain each domain. The Y-axis represents the retweet rates of domains as computed above.


Overall, we see that not all popular URL domains in tweets are popular in retweets. The domain of URLs also matters.

References
[1] Suh, B., Hong, L., Pirolli, P., and Chi, E. H. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network. To appear in SocialCom'10.

Monday, August 9, 2010

Want to be Retweeted? Add Hashtags to Your Tweets!

In a recent study, Bongwon Suh, Peter Pirolli, Ed H. Chi, and I examined what factors might affect retweetability of a tweet. We will report our findings in an upcoming paper this August in the Second IEEE International Conference on Social Computing [1]. In this post, I focus on a factor that we found to correlate with retweetability: hashtag.

Before I dive into the details, here is some information about the dataset that we used in our study: From Twitter's Spritzer feed, we collected a random sample of public tweets from January 18, 2010 to March 8, 2010, yielding about 74 million tweets. That is, we collected about 1.5 million tweets per day, representing approximately 2-3% of 50 million tweets appearing on Twitter daily.

For each of these 74 million tweets, we scanned for a variety of retweet markers such as "RT @", "RT:@", "retweeting @", "retweet @", "via @", "thx @", "HT @", and "r @" [2]. We found that there are about 8.24 million retweets, accounting for 11.1% of all the tweets. Next, we searched for those tweets and retweets that contain at least one hashtag. We found that 10.1% of tweets and 20.8% of retweets include hashtags, suggesting that a tweet with hashtags is more likely to get retweeted.

We further investigated whether the retweetability of a tweet has anything to do with the type of hashtag it contains. Analyzing the 74 million tweets, we identified the 20 most popular hashtags used in our tweets and the number of tweets containing each hashtag:

Rank Hashtag Number of Tweets


1 #nowplaying 355,147



2 #ff 224,760

3 #jobs 124,728



4 #fb 87,959



5 #tinychat 67,225



6 #vouconfessarque 51,578



7 #fail 49,248



8 #tcot 47,394



9 #1 47,373





10 #followfriday





39,986





11 #news 38,573
12 #shoutout

30,633





13 #tweetmyjobs

30,594

14 #bbb

28,590

15 #haiti

28,563

16 #letsbehonest

27,926

17 #iranelection

27,611

18 #quote

27,541



19 #followmejp

25,940

20 #follow

24,166



On the other hand, the following table shows the 20 most popular hashtags used in our 8.24 million retweets and the number of retweets containing each hashtag:
Rank Hashtag Number of Retweets
1 #ff 62,331

2 #vouconfessarque 43,628

3 #nowplaying 29,846

4 #tcot 18,527

5 #idothat2 16,583

6 #ohjustlikeme 16,531

7 #jafizisso 15,564

8 #haiti 13,829

9 #retweetthisif 12,602

10 #iranelection





12,334



11 #quote

11,475



12 #followfriday

11,170



13 #fb

10,994



14 #ihatequotes

9,982



15 #fail

9,759



16 #omgthatssotrue

9,286



17 #1

9,124



18 #terremotochile

8,892





19 #p2

8,719



20 #follow

8,084





As can be seen, these two lists of hashtags do not match each other exactly. For example, #jobs appears only in the first list, while #idothat2 appears only in the second list. That is, the fact that a hashtag is frequently used in the tweets does not guarantee that it is also frequently used in the reweets, and vice versa.

For each hashtag, we computed a retweet rate by dividing the number of retweets containing the hashtag by the number of tweets containing the hashtag. We then normalized the rate so that a value of 1.0 represents the average retweet rate of 11.1%. For example, for #nowplaying, the retweet rate of 0.75 was calculated as (29,846/355,147)*(74/8.24). A hashtag with a retweet rate higher than 1.0 indicates that, compared to the average case, the tweets containing this hashtag have a higher chance of getting retweeted. The following table shows the retweet rates for the 10 most popular hashtags used in our tweets:
Rank Hashtag Retweet Rate
1 #nowplaying 0.75
2 #ff 2.49
3 #jobs 0.16
4 #fb 1.12

5 #tinychat 0.04

6 #vouconfessarque 7.59

7 #fail 1.78

8 #tcot 3.51

9 #1 1.73

10 #followfriday





2.51





In the following plot, each point represents an individual hashtag. The X-axis is the popularity rank of hashtags based on how many tweets contain each hashtag. The Y-axis represents the retweet rates of hashtags as computed above.


From the figure, we see that the retweet rates vary greatly. Not all popular hashtags in tweets are popular in retweets. The type of hashtag does matter.

References
[1] Suh, B., Hong, L., Pirolli, P., and Chi, E. H. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network. To appear in SocialCom'10.
[2] boyd, d., Golder, S., and Lotan, G. Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter. Proc. HICSS'10, 1-10.

Wednesday, August 4, 2010

Grounded research on Enterprise2.0: the creation of a social information stream browser


For a number of years, because of our research group's context within a large corporation like Xerox, we have been studying the effectiveness of Enterprise2.0 tools. As Web2.0 consumer tools have changed over time, so has Enterprise2.0 tools. We recently described one such tool that is applicable to both consumer and enterprise users at the AVI conference that was held in Italy in May.

The primary challenge in doing Enterprise2.0 research is the need to ground the research in real data, real user behaviors, and real practices. The class of knowledge workers that has emerged after the proliferation of Web2.0 and Enterprise2.0 tools is distinctly different from past knowledge workers.

We conducted field studies of two groups of senior professionals, and found that the primary challenge for them was far beyond information overload. The knowledge workers now face not just information overload, but also channel overload. That is, they must understand the intricacies of different channels, how often their co-workers pay attention to those channels, and therefore, adjust their strategy for contributing the right content at the right time in the right places. They use these different channels to monitor status, progress updates of both individual as well as group activities, and they use these tools to forage and organize new information.

There is much detail in the research, including a step in which we first characterized the user behaviors and challenges, then a design iteration that was done with paper prototypes, and finally a software prototype was built and evaluated. To make a long story short, we found a number of important requirements for new Enterprise2.0 tools that we've summarized into a table below:


From this set of requirements, we decided to tackle the issue around channel overload via a faceted-search browser for social information streams. The figure below shows an example screenshot of our FeedWinnower system:


We previously blogged about FeedWinnower in April. As shown in the Figure above, we extract a number of meta-data from the social information stream, such as the author of the postings, the source and media types, as well as the topic of the posting, and the time when the posting was made.




Of course, we are not the only ones to have realized these needs. Email is one of the oldest social information streams. Neustaedter et al. [1] found that sender, receiver, and time were main attributes that people used to judge the importance of email. Whittaker et al. [2] noted that filing messages in folders is time-consuming and can be problematic if users’ focus changes frequently, suggesting the need for flexible interfaces to allow on-the-fly browsing of content. Hearst suggested that social tags provide an excellent basis for the formation of topic structures for faceted browsing [3], but stressed that acquisition of facet metadata is a problem remaining to be addressed. Related research also includes the design of blog search and browsing interfaces [4]. Hearst et al. [4] suggested design choices such as “the temporal/timelines aspect of blogging” and “automatic creation of a feed reader on the subtopics of interest”. Baumer and Fisher [5] proposed an interface for organizing blogs around a list of extracted topics. Probably the most closely related work is the tool by Dork et al. [6], which organizes RSS feeds along three dimensions: time, location, and tags. It also supports a faceted browsing interface. They assumed that feed items have titles and descriptions, time of creations, locations, and tags. A key difference is that we make no assumptions about the presence of tags or manually added metadata. Instead, we construct the topic facet from the content of the items.

In summary, studying two communities from a large IT enterprise, we characterized the work practices and information-management needs of a growing class of busy knowledge workers. We found that they need:
  1. information aggregated across multiple channels, including the combination of content and status updates,
  2. filters that help to easily find important content, and
  3. organization and sharing functions for individual and collaborative sensemaking.
I have a feeling that building these types of information stream interfaces will be the subject of our tool research in the group for some time.


References
[1] Neustaedter, C., Brush, A., and Smith, M. Beyond “From” and “Received”: Exploring the Dynamic of Email Triage. Proc. CHI’05, 1977-1980.

[2] Whittaker, S. and Sidner, C. Email Overload: Exploring Personal Information Management of Email. Proc. CHI’96, 276-283.

[3] Hearst, M. UIs for Faceted Navigation: Recent Advances and Remaining Open Problems. Proc. 2008 Workshop on Human-Computer Interaction and Information Retrieval.

[4] Hearst, M, Hurst, M., and Dumais, S. What Should Blog Search Look Like? Proc. 2008 ACM Workshop on Search in Social Media, 95-98.

[5] Baumer, E. and Fisher, D. Smarter Blogroll: An Exploration of Social Topic Extraction for Manageable Blogrolls. Proc. HICSS’08.

[6] Dork, M., Carpendale, S., Collins, C., and Williamson, C. VisGets: Coordinated Visualizations for Web-based Information Exploration and Discovery. IEEE Trans on Vis. and Computer Graphics, 14(6), 1205-1212.

Monday, June 14, 2010

Model-Driven Research in Social Computing

I'm in Toronto attending the Hypertext 2010 conference, where I gave the keynote talk at the First Workshop on Modeling Social Media yesterday. I want to document a little bit of the points I made in the talk here.

The reason we seek to construct and derive models is to predict and explain what might be happening in social computing systems. For social media, we seek to understand how these systems evolve over time. Constructing these models should also enable us to generate new ideas and systems.

As an example, many have proposed a theory of influentials that identifying a small group of individuals who are connected to the larger social network just in the right way, we can infect or reach the rest of the people in the network. This idea is probably most well-known in the press by the popular book Tipping Point by Gladwell. This model of how information diffuse in social networks is very attractive, not just due to its simplicity, but also the potential of applying this idea in areas such as marketing.

Models such as this are meant to be challenged and debated. They are always strawman proposals. Duncan Watts' simulation on networks have shown that the validity of this theory is somewhat suspect. Indeed, recently, Eric Sun and Cameron Marlow's work, published in ICWSM2009, showed that this theory of influentials might be wrong. They suggest that "diffusion chains are typically started by a substantial number of users. Large clusters emerge when hundreds or even thousands of short diffusion chains merge together."

Most, if not all, models are wrong. Some models are just more wrong than others. But models still serve important roles. They might be divided into several categories:

  1. Descriptive Models describe what is going on within the data. This might help us spot trends, such as the growth of number of contributors, or trending topics in a community.
  2. Explanatory Models help us explain what might be the mechanisms underlying processes in the system. For example, we might be able to explain why certain groups of people contribute more content than another group.
  3. Predictive Models help us engineer systems by predicting what users and groups might want, or how they might act in systems. Here we might build probabilistic models of whether a user will use a particular tag on a particular item in a social tagging system.
  4. Prescriptive Models are set of design rules or a process that helps practitioners generate useful or practical systems. For example, Yahoo's Social Design Patterns Library on Reputation is a very good example of a prescriptive model.
  5. "Generative Models" actually have two meanings depending on who you're talking to. In statistical circles, "generative models" are models that help generate data that look like real user data and are often probabilistic models. Information Theory is a good example of this approach, in fact. Generative Models could also mean that they are models that help us generate ideas, novel techniques and systems. My work with Brynn Evans on building a social search model is an example of this approach.
In the talk, I illustrated how we have modeled the dynamics in the popular social bookmarking system, Delicious, using Information Theory. I also showed how using equations from Evolutionary Dynamics we were better able to explain what might be happening to Wikipedia’s contribution patterns. Talk Title: Model-driven Research for Augmenting Social Cognition

Friday, May 14, 2010

Ushahidi: A crowdsourcing site you probably have not heard of

Us research scientists always go after the latest and greatest shiny cool thing to study on the Web (like us with Wikipedia), but of course, the real world is full of chaos, anger, fear, and all the unpleasant things we all prefer to forget about. What can the Social Web, Collective Intelligence, and Utopia have possibly anything to do with all that?



Ushahidi is a "platform that allows anyone to gather distributed data via SMS, email, or the web, and visualize the data on a map or timeline." The goal is to "create the simplest way of aggregating information from the public for use in crisis response." In March, while I was on an around-the-world trip to Beijing and Amsterdam, I read this article in the NYTimes about Ushahidi, and thought about how work like this reaffirms my belief that the Social Web is changing how information is distributed and used in the world, and that it is revolutionary. Ushahidi (which means testimony in Swahili) has now been used in the Kenya's disputed election in 2007 (documenting the violence) as well as Haitian and Chilean earthquakes. "It collected more testimony with greater rapidity than any reporter or election monitor." "The site collected user-generated cellphone reports of riots, stranded refugees, rapes and deaths and plotted them on a map, using the locations given by informants."

Wow!

Let's think for a second about what happened. Someone (Ory Okolloh) who cared about what's happening in Kenya blogged about what was happening, and thought about how a web application could change the transparency of the events to be visible to the world, and then tech geeks read her post and build the web site over a long weekend. Then boom! The world changes.

Shocking.

Why did it work? What's the participation architecture? And what role did technology play in this? Clearly, attention around an event was aggregated, and this came as a result of a confluence of events. The participation architecture relied on the fact that people cared enough about what's happening to build the system, and the people on the ground to have the right technology to report the events to the website, and technology enabled the mapping of these events to a map. Viola! Mass data visualization results.

Amazing.

Amazing because this happened in such a distributed fashion. No government agency got involved, and no centralized authority coordinated the work over a multi-year government grant. Now this has been exported back to the USA and in Washington D.C., the system was used to warn about dangerous roads during the big snowstorm.


The same snow storm that caused the Technology mediated Social Participation workshop to move the date of the 2nd East Coast Workshop. Ironic, isn't it?

Ironic also because this was almost precisely what I had proposed to a Gov't funding agency program manager who visited PARC about 3 years ago (Aug 2007) who was interested in disaster response. She never followed up and we weren't funded on the idea. But here are a few slides from that presentation:

I think Ushahidi is awesome. Ushahidi happened because of people believed and cared about what is happening in the world. That, to me, is the power of the social web.

Wednesday, April 21, 2010

Short and Tweet: Experiments on Recommending Content from Information Streams (Specifically, Twitter)

Information streams have recently emerged as a popular means for information awareness. By information streams we are referring to the general set of Web 2.0 feeds such as status updates on Twitter and Facebook, and news and entertainment in Google Reader or other RSS readers. More and more web users keep up with newest information through information streams. At the CHI2010 conference, we presented a new system called Zerozero88.com that recommends contents (particularly URLs that have been posted in Twitter) to users based on their profile on Twitter. Through recommender systems, we hope to better direct user attention to the most interesting URLs that are posted on Twitter that the user should pay attention to.

As a domain for recommendation, information streams have three interesting properties that distinguish them from other well-studied domains:
  1. Recency of content: Content in the stream is often considered interesting only within a short time of first being published. As a result, the recommender may always be in a “cold start” situation, i.e. there is not enough data to generate a good recommendation.
  2. Explicit interaction among users: Unlike other domains where users interact with the system as isolated individuals, with information stream users explicitly interact by subscribing to others’ streams or by sharing items.
  3. User-generated content: Users are not passive consumers of content in information streams. People are often content producers as well as consumers.

In a modular approach, we explored three separate dimensions in designing such a recommender: content sources, topic interest models for users, and social voting:
  1. Content Sources: Given limited access to tweets and processing capabilities, our first design question is how to select the most promising candidate set of URLs to consider for recommendations. We chose two strategies: First, Sarwar et al. [1] have shown that by considering only a small neighborhood of people around the end user, we can reduce the set of items to consider, and at the same time expect recommendations of similar or higher quality.

    Second, we also considered a popularity-based URL selection scheme. URLs that are posted all over Twitter are probably more interesting than those rarely mentioned by anyone.

  2. Topic Modeling: Using topic relevance is an established approach to compute recommendations. The topic interest of a user is modeled from text content the user has interacted with before, and candidate items are ranked by how well they match the topic interest profile of the user. Another way to model the user's interest is by modeling the topics of the tweets made by the people she follows.
  3. Social Voting: Assuming the user has a stable interest and follows people according to that interest, people in the neighborhood should be similar minded enough so that voting on the neighborhood can function effectively. However, the “one person, one vote” basis in the approach above may not be the best design choice in Twitter, because some people may be more trustworthy than others as information sources. Andersen et al. discussed several key insights in their theory of trust-based recommender systems [2], one of which is trust propagation. Intuitively, trust propagation means my trust in Alice will increase when the people whom I trust also show trust in Alice. Following this argument, a person who is followed by many of a user’s followees is more trustworthy as an information source, and thus should be granted more power in the voting process.


The figure below describes the overall design of the system. The URL Source selectors from the lower left are content items that feed into the system to be ranked. The left side of the system does the topic modeling, which can come from either the user's own tweets, or the followee's tweets. The social voting model is implemented using modules on the right.



We implemented 12 recommendation engines in the design space we formulated above, and deployed them to a recommender service on the web to gather feedback from real Twitter users. The best performing algorithm improved the percentage of interesting content to 72% from a baseline of 33%.

Overall, we found that:
  1. The social voting process seems to contribute the most to the recommender accuracy.
  2. The topic models also contribute to the accuracy, but modeling using the user's self tweets is more accurate (with the caveat that the user actually tweets, not merely listen by following people).
  3. Selecting URLs based on the neighborhood seems to work better than globally popular URLs, but the results are not yet statistically significant.
  4. The best performing algorithm is FoF-Self-Vote (that is, using the neighborhood for URL content sources, self-tweets for topic modeling, and social voting.)



You can try out the beta system at http://zerozero88.com, but since it is still in beta, we can probably only enable the accounts of a limited number of people who sign up.

You can also read more about our results in the published paper [3].

Update 2010-08-23: Slides available here.

References

[1] Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J. 2002. Recommender systems for large-scale ECommerce: Scalable neighborhood formation using clustering. In Proc of ICCIT 2002.

[2] Andersen, R., Borgs, C., Chayes, J., Feige, U., Flaxman, A., Kalai, A., Mirrokni, V., and Tennenholtz, M. 2008. Trust-based recommendation systems: an axiomatic approach. In Proc of WWW ‘08.

[3] Chen, J., Nairn, R., Nelson, L., Bernstein, M., and Chi, E. 2010. Short and tweet: experiments on recommending content from information streams. In Proceedings of the 28th international Conference on Human Factors in Computing Systems (Atlanta, Georgia, USA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, 1185-1194. DOI= http://doi.acm.org/10.1145/1753326.1753503

Monday, April 12, 2010

Information Stream Overload

Information overload is a growing threat to the productivity of today’s knowledge workers, who need to keep track of multiple streams of information from various sources. RSS feed readers are a popular choice for syndicating information streams, but current tools tend to contribute to the overload problem instead of solving it.  Ironic, isn't it?

A significant portion of the ASC team is here in Atlanta to present work related to this information overload problem, and I will blog about it in the next week or so.

Tomorrow, we will be presenting a paper on FeedWinnower, an enhanced feed aggregator that helps readers to filter feed items by four facets (topic, people, source, and time), thus facilitating feed triage. The four facets corresponds to the What, When, Who, and When questions that govern much information architecture design.  The combination of the four facets provides a powerful way for users to slice and dice their personal feeds.

First, a topic panel allows users to drill down into the specific topics that she might be interested in:


Second, a people panel allows filtering on the source of the person who created the information item in the stream:


Third, a source panel allows filtering of the type of information stream the item came from:


And finally, a time panel allows filtering for a particular time period that you might be interested in out of the information stream:



Usage Scenarios
By combining the four facets, users can examine and navigate their feeds, deciding what items to skip and what to read. Here we give two illustrative real-world scenarios.

Scenario 1: At the end of a workday, Mary opens FeedWinnower to get a sense of what has been happening around her. Using the time facet, she finds out that 507 items came into her account earlier in the day. Glancing at the topic facet, she sees “iphone” and a few other topics being talked about. As she clicks on “iphone”, the right screen shows only 7 items after filtering out other items. In the people facet, she identifies that these 7 items came from 4 of her friends and decides to read those items in detail.

Scenario 2: John wants to find out what his friends have been chatting about on Twitter lately. He selects “Twitter” in the source facet and chooses “yesterday” in the time facet. This yields 425 items. In the people facet, he then excludes those creators that he wants to ignore, filtering down to 324 items. Looking at the topic facet, he sees “betacup” and wonders what it is about. After clicking on “betacup” and reading the remaining 7 items, he now has a fair understanding about the term “betacup”.
In these two scenarios, we see how the four facets enable users to construct simple queries to accomplish their needs. We also see how the topic facet is essential in obtaining an overview of the topical trends in the feeds and helping users to decide what is worth reading in depth.

The paper reference is:
Hong, L., Convertino, G., Suh, B., Chi, E. H., and Kairam, S. 2010. FeedWinnower: layering structures over collections of information streams. In Proceedings of the 28th international Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, 947-950. DOI= http://doi.acm.org/10.1145/1753326.1753466

Monday, March 8, 2010

Wikipedia's People-Ware Problem

Last week, we hosted a visit from the Wikimedia Foundation on issues relating to our work on community analytics, and what it tells us about Wikipedia's problems and possible solutions. Naoko Komura (pictured at right) of the Wikimedia Usability Initiative, as well as Eric Zachte, the staff data analyst (also pictured at right), spoke very eloquently about how we can create social tools to direct the best social attentions to the needed parts of Wikipedia.



Fundamentally, Wikipedia has always had a "people-ware" problem: the distribution of the expertise that is freely donated to the right places.  It has been and always will remain its greatest challenge. The amazing thing about Wikipedia is that it managed to do this for so long, such that a valuable knowledge repository can be built up as a result.  At first, people simply came because it was the place to be.  Now, we have to work a little harder.

We spent a lot of time talking about the best way to model this people-ware problem, either using biological metaphors (evolutionary systems with various forces), or economic models (see last post here).  However, one thing to be aware of is the danger of "analysis paralysis", where you spend so much time analyzing the problem, and forget that there are already many ideas that have been generated for moving the great experiment forward.

For example, there are many places in Wikipedia that are not well populated. It's well-known that many scientific and math concept articles, for example, could use an expert-eye to catch the errors and explain the concepts better. How can we build an expertise finder that would actually invite people to fix problems that we know exists in Wikipedia?

Another idea might be to have the whole system be more social. Chris Grams blogs about a part of this idea here. We suggested some time ago to have a system like WikiDashboard, where you actually show the readers what the social dynamics have been for a particular article.

Wikipedia was created in 2001, when social web was still in its infancy. During the ensuing 9 years, it has changed very little, and I would argue Wikipedia have not kept up with the times. Lots of "Social Web" systems and new cultural norms have been built up already.  For example, I suspect that many of us would not mind at all to reveal our identities on Wikipedia, and we might like to login with our OpenIDs and even have verified email addresses so that the system can send me verification/clarification/notification messages. The system perhaps should connect with Facebook, so that my activities (editing an article on "Windburn") is automatically sent to my stream there. My friends, upon seeing that I have been editing that article, might even join in.

I think that Wikipedia is about to change, and it is going to become a much more socially-aware place. I certainly hope that they will tackle the People-Ware (instead of the Tool-Ware) problems, and we will see it become an exciting place again.

Thursday, March 4, 2010

The problem of matching social attention and products...


Many people have already stolen the attention-scarcity ideas from Herb Simon and said that the most important problem in our information overloaded society is the efficient distribution of attention. What some have called the "attention economy" is nothing more than a re-packaging of this idea.

In business, of course, getting the consumers' attention is quickly becoming an important aspect of being successful. Traditional ways of getting people's attention is through advertisement, and we have witnessed a dramatic transformation of how advertisements work in the online world in the last decade, from display advertising to search advertising and, more recently, further to action advertising. Increasingly, we can tie advertising dollars to direct consumer action.

For us, it was not a stretch, then, to start thinking about how the consumer actions are starting to quickly feedback to product design. Thus, we now have people talking about crowdsourced product designs. The most agile companies now listen to the consumers via channels such as Facebook, Twitter, and Blog analytics. They do this via services such as brand management consultants and sentiment analysis tools, so much so, they are able to discern tiny changes in consumer awareness of product issues and their desires.

We know also that traditional economic models serves to optimize the distribution of products to people who want them. But these models have also recently been used to optimize the distribution of people's attention to products that might serve their needs. The two usages obviously goes hand-in-hand.

If we can help companies to serve people attention spots just-in-time with the best products, we would have a highly optimized economy that wastes little energy in distributing worthless advertisements (or spam). In fact, the existence of spam points to the inefficiencies in the economic system.

Turns out that versions of this problem exists everywhere in the Web2.0 world:
  • The problem of efficiently distributing the best tweets to the people who want to view them is a version of this attention distribution problem. Any time you see a tweet that was worthless to you is an opportunity for optimization.
  • The problem of pointing experts to the most valuable articles that they can contribute to in Wikipedia is another version.
Solutions to these problems might take the form of recommendation systems or filtering systems, but might also be efficient interactive browsing systems (for products in an online store like Amazon, or articles in Wikipedia). Some thought experiments:
  • What if we can design an expertise finding system that recommends the best articles for you to contribute to in Wikipedia? Would it increase participation rates?
  • What if we analyze your social network everyday and tell you the best tweets that you should spend five minutes on? Would more people retweet more often?
  • What if product designers are better tuned to trending topics and needs, would they enable companies to succeed more often? Are companies like Zazzle and Cafepress the prototype examples of lubricating this path?

Your thoughts?

Wednesday, January 20, 2010

What are big research problems in Social Web technologies?

Just finished reading Dion Hichcliffe's piece over at ZDNet on emerging technologies for Social Web in 2010. I have been reading all these different predictions to see how it relates to our research agenda. Dion's piece is long, but several points resonated with what we have been doing:

First, he said that one problem we have is
"Poor integration between social media and location services. Again, while there’s already some location awareness in social networking services today, there’s a long way to go before it’s integrated meaningfully into the social experience to provide real utility."
I agree wholeheartedly. Not too long ago, I participated in a research project here at PARC called Magitti, which was an activity recommender that modeled your content interests, your schedule, your location, as well as the your personal history on the mobile device [1]. The integration of personalization and social features with location-aware services will be a significant trend in 2010, and there will be a lot of good research and products in this area.

Second, he said that people are having difficulties in
"coherently engaging in social activity across many channels. Tired of the day-long round-robin between your e-mail, SMS, Twitter, Facebook, and any other services you use to keep up with what’s going on? You’re not the only one. While aggregation services such as Friendfeed potentially cut down on the manual effort of using the social Web, it’s still not mainstream despite being a good example of what’s possible. Notably it’s often the big (and closed) social silos that are causing the problem."
Our group was an early adopter of FriendFeed, and realized that many of the issues relating to social annotation, commenting, and other interactions were due to the distributed nature of social media. It is hard to keep track of who said what, and the aggregate reactions to content. Our research group has some investments in this research problem, which relates to aggregation and the ability to browse and filter the feeds. We are about to publish a paper in CHI2010 about how to use faceted browsing techniques to partially solve this problem [2].

Finally, the most important point he made was the our need in
"Coping with and getting value from the expanding information volume of social media. We’re all learning how to deal with the firehose of information that flows out of social media on a minute-by-minute basis. Sometimes it’s hard to remember that this flow of transparent and open information is actually good and often useful and creates important conversations. But the simple fact is that much of it isn’t meant for non-stop, instantaneous consumption [emphasis added]; it simply isn’t practical. Rather, social media leaves behind artifacts and information that we can find and use later when we need them. But at the moment the process of sorting through, aggregating, and filtering the vast volume of information cascading through social media today remains a real and growing challenge. I also began to get the first real reports that this is happening in the enterprise last year as social media begins to grow there as well."

Here ASC group's investment in summarization, recommendation, and personalization, etc, hopefully will pay off. Our investments have been in understanding particularly how to apply these techniques in social media, with the added social contexts and new data mining techniques around social streams. Research-wise, we will be pushing on this last point the most, and I believe it is also the area we most likely can extract user value. We are about to publish a paper at CHI2010 on how to do recommendations on Twitter network [3].

I will blog about these research efforts soon.

----
[1] Victoria Bellotti, James Bo Begole, Ed H. Chi, Nicolas Ducheneaut, Ji Fang, Ellen Isaacs, Tracy King, Mark Newman, Kurt Partridge, Bob Price, Paul Rasmussen, Michael Roberts, Diane J. Schiano, Alan Walendowski. Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure Guide. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008), pp. 1157-1166. ACM Press, 2008. Florence, Italy.

[2] Hong, L.; Convertino, G.; Suh, B.; Chi, E. H.; Kairam, S. FeedWinnower: layering structures over collections of information streams. Submitted and accepted to ACM CHI2010.

[3] Chen, J., Nairn, R., Nelson, L., Chi, E. H. Short and Tweet: Experiments on Recommending Content from Information Streams. Submitted and Accepted to ACM CHI2010.