Augmented Social Cognition Research Blog from PARC

ASC blog deprecated (moved to parc.com)

2011-03-28T23:36:00.000-07:00

This entry serves as a marker that the ASC Team blog is no longer active.

From now on, PARC's social computing researchers will blog at: http://blogs.parc.com/blog/topics/social-computing/

Ed H. Chi has left PARC and became a Research Scientist at Google, and will continue to blog at his personal blog as well as an ACM blogger.

Thanks all for listening for all these years!

--Ed

Further details on 'Location' field behavior on Twitter

2011-01-25T15:56:00.000-08:00

There are of course a lot more details on the 'Location' field study in the previous post, which was covered by various press outlets (Seattle PI, AllThingD, ReadWriteWeb, NYTimes.) There are several further details that're worth pondering about:

First thing is on geo-information scale. Out of the 66% of users with any valid geographic information, those that were judged to be outside of the United States were excluded from our study of scale. Users who indicated multiple locations (see below) were also filtered out. This left us with 3,149 users who were determined by both coders to have entered valid geographic information that indicated they were located in the United States.

When examining the scale of the location entered by these 3,149 users, an obvious city-oriented trend emerges (Figure below). Left to their own devices, users by and large choose to disclose their location at exactly the city scale, no more and no less. As shown in Figure below, approximately 64% of users specified their location down to the city scale. The next most popular scale was state-level (20%).

When users specified intrastate regions or neighborhoods, they tended to be regions or neighborhoods that engendered significant place-based identity. For example, “Orange County” and the “San Francisco Bay Area” were common entries, as were “Harlem” and “Hollywood”. Interestingly, studying the location field behavior of users located within a region could be a good way to measure the extent to which people identify with these places.

This might not have been a surprise. What's perhaps more interesting is the behavior around specifying multiple locations. 2.6% of the users (4% of the users who entered any valid geographic information) entered multiple locations. Most of these users entered two locations, but 16.4% of them entered three or more locations. Qualitatively, it appears many of these users either spent a great deal of time in all locations mentioned, or called one location home and another their current residence. An example of the former is the user who wrote “Columbia, SC. [atl on weekends]” (referring to Columbia, South Carolina and Atlanta, Georgia). An example of the latter is the user who entered that he is a “CALi b0Y $TuCC iN V3Ga$” (A male from California “stuck” in Las Vegas).

Looking at the 10,000 profiles we examined, the most categorically distinct entries we encountered were the automatically populated latitude and longitude tags that were seen in many users’ location fields. After much investigation, we discovered that Twitter clients such as ÜberTwitter for Blackberry smartphones entered this information. Approximately 11.5% of the 10,000 users we examined had these latitude and longitude tags in their location field. The vast majority of the machine-entered latitude and longitude coordinates had six significant digits after the decimal point, which is well beyond the precision of current geolocation technologies such as GPS. While it depends somewhat on the latitude, six significant digits results in geographic precision at well under a meter. This precision is in marked contrast with the city-level organic disclosure behavior of users.

This mismatch leads us to a fairly obvious but important implication for design. Any system automatically populating a location field should do so, not with the exact latitude and longitude, but with an administrative district or vernacular region that contains the latitude and longitude coordinate. It is likely that users would prefer not to reveal their location to such precise coordinates if they had the choice to specify the granularity.

Overall, the picture that this data paints suggest a wide variety of ways in which people wanted to communicate to others about their location. Some are at multiple locations often, while others wanted to express a cultural or neighborhood identity through their location. Users often want to have the ability to express sarcasm, humor, or elements of their personality through their location field. In many ways, this is not a surprise; people’s geographic past and present have always been a part of their identity. We are particularly interested in the large number of users who expressed real geographic information in highly vernacular and personalized forms. Designers may want to invite users to choose a location via a typical map interface and then allow them to customize the place name that is displayed on their profile. This would allow users who enter their location in the form of “KC N IT GETS NO BETTA!!” (a real location field entry in our study) to both express their passion for their city and receive the benefits of having a machine-readable location, if they so desire.

View Larger Map

Our findings also suggest that Web 2.0 system designers who wish to engender higher rates of machine-readable geographic information in users’ location fields may want to force users to select from a precompiled list of places.

People who entered multiple locations motivate an additional important implication for design. That is, to give users the ability to specify their activities in various locations, such as home, work, current, visiting city, favorite bar, etc. Other directions of future work include examining per-tweet location disclosure, as well as evaluating location disclosure on social network sites such as Facebook.

"Location" Field in Twitter User Profiles (and an interesting fact about Justin Bieber)

2011-01-18T13:56:00.000-08:00

Interest in geographic information has intensified in the last year or two. One of the ways in which people obtain geolocation data is the decoding of the "Location" field during account sign-up. Many researchers have used this field for analysis of where the users of a service might be coming from. For example, Mashable has a nice write up of services that depend on geolocation data for twitter. But little research exists on one of the most common, oldest, and most utilized forms of online social geographic information: the “location” field found in most virtual community user profiles.

Recently, our summer intern Brent Hecht, who was visiting us from Northwestern University, performed the first in-depth study of user behavior around the 'location' field in Twitter user profiles. Here is what we found.

From April 18 to May 28, 2010, we collected about 32 million English tweets from the Spritzer sample feed. Our 32 million English tweets were created by 5,282,657 unique users. Out of these users, we randomly selected 10,000 “active” users for our first study. We defined “active” as having more than five tweets in our dataset, which reduced our sampling frame to 1,136,952 users (or 22% of all users). We then extracted the contents of these 10,000 users’ location fields and placed them in a coding spreadsheet. Two coders examined the 10,000 location field entries. Coders were asked to use any information at their disposal, from their cultural knowledge and human intuition to search engines and online mapping sites.

As shown in Figure below, only 66% of users manually entered any sort of valid geographic information into the location field. This means that although the location field is usually assumed by practitioners and researchers to be a field that is as associated with geographic information as a date field is with temporal information, this is definitely not the case in our sample.

We found that 34% of users did not provide real location information, frequently incorporating fake locations or sarcastic comments that can fool traditional geographic information tools. The remaining one-third of users were roughly split between those that did not enter any information and those that entered either non-real locations, obviously non-geographic information, or locations that did not have specific geographic footprints. When users did input their location, they almost never specified it at a scale any more detailed than their city.

An analysis of the non-geographic information entered into the location field revealed it to be highly unpredictable in nature. A striking trend was the theme of Justin Bieber, who is a teenage singer. A surprising 61 users (more than 1 in 200 users) co-opted the location field to express their appreciation of the pop star. For instance, a user wrote that s/he is located in “Justin Biebers heart” and another user indicated s/he is from “Bieberacademy”. Justin Bieber was not the only pop star that received plaudits from within the location field; United Kingdom “singing” duo Jedward, Britney Spears, and the Jonas Brothers were also turned into popular “locations”.

Another common theme involved users co-opting the location field to express their desire to keep their location private. One user wrote “not telling you” in the location field and another populated the field with “NON YA BISNESS!!” Sexual content was also quite frequent, as were “locations” that were insulting or threatening to the reader (e.g. “looking down on u people”). Additionally, there was a prevalent trend of users entering non-Earth locations such as “OUTTA SPACE” and “Jupiter”.

A relatively large number of users leveraged the location field to express their displeasure about their current location. For instance, one user wrote “preferably anywhere but here” and another entered “redneck hell”.

Entering non-real geographic information into the location field was so prevalent that it even inspired some users in our sample to make jokes about the practice. For instance, one user populated the location field with “(insert clever phrase here)”.

Note that, in the 66% of users who did enter real geographic information, we included all users who wrote any inkling of real geographic information. This includes those who merely entered their continent and, more commonly, those who entered geographic information in highly vernacular forms. For example, one user wrote that s/he is from “kcmo--call da po po”. Our coders were able to determine this user meant “Kansas City, Missouri”, and thus this entry was rated as valid geographic information (indicating a location at a city scale). Similarly, a user who entered “Bieberville, California” as her/his location was rated as having included geographic information at the state scale, even though the city is not real.

Our study on the information quality has vital implications for leveraging data in the location field on Twitter (and likely other websites). Namely, many researchers have assumed that location fields contain strongly typed geographic information, but our findings show this is demonstrably false. To determine the effect of treating Twitter’s location field as strongly-typed geographic information, we took each of the location field entries that were coded as not having any valid geographic information (the 16% slice of the pie chart) and entered them into Yahoo! Geocoder. This is the same process used by Java et al. [1] A geocoder is a traditional geographic information tool that converts place names and addresses into a machine-readable spatial representation, usually latitude and longitude coordinates [2].

Of the 1,380 non-geographic location field entries, Yahoo! Geocoder determined 82.1% to have a latitude and longitude coordinate. As our coders judged none of these entries to contain any geographic information or highly ambiguous geographic information, this number should be zero (assuming no coding error). Some examples of these errors are quite dramatic. “Middle Earth” returned (34.232945, -102.410204), which is north of Lubbock, Texas. Similarly, “BieberTown” was identified as being in Missouri and “somewhere ova the rainbow”, in northern Maine. Even “Wherever yo mama at” received an actual spatial footprint: in southwest Siberia.

Middle Earth:

View Larger Map

Since Yahoo! Geocoder assumes that all input information is geographic in nature, the above results are not entirely unexpected. The findings here suggest that geocoders alone are not sufficient for the processing of data in location fields. Instead, data should be preprocessed with a geoparser, which disambiguates geographic information from non-geographic information [2]. However, geoparsers tend to require a lot of context to perform accurately. Adapting geoparsers to work with location field entries is an area of future work.

References
[1] Java, A., Song, X., Finin, T. and Tseng, B. Why We Twitter: Understanding Microblogging Usage and Communities. Joint 9th WEBKDD and 1st SNA-KDD Workshop ’07, San Jose, CA, 56-65.

[2] Hecht, B. and Gergle, D. A Beginner’s Guide to Geographic Virtual Communities Research. Handbook of Research on Methods and Techniques for Studying Virtual Communities, IGI, 2010.

Revamping WikiDashboard

2010-09-03T14:44:00.000-07:00

I released WikiDashboard almost three years ago. Believe it or not, the server for WikiDashboard has been running under my desk for three full years (the photo shows the actual server). It was launched in a rush to meet a deadline for an academic paper that we published at a conference (ACM SIGCHI 2008) and limited maintenance has been done so far.

The old Power Mac (http://en.wikipedia.org/wiki/Power_Mac_G5 ) has been pretty reliable but it is becoming increasingly untrustworthy lately. Frustrated with frequent crashes, hangs, and sluggishness, I finally decided to do something. As I’m migrating the tool out of the old machine, I’ve added a few new features. I hope you find it useful.

http://wikidashboard.appspot.com

Faster and scalable infrastructure
The server is now running on Google App Engine. WikiDashboard is hosted as a web app on the same systems that power Google applications. WikDashboard should provide faster, reliable, and scalable service to you. I plan to keep the old server running for a bit but it will eventually forward the traffic to the new server.

Support ten more languages
Thank you to everyone who showed interest in having WikiDashboard in your own language version!

English http://wikidashboard.appspot.com/enwiki
Japanese (日本語) http://wikidashboard.appspot.com/jawiki
German (Deutsch) http://wikidashboard.appspot.com/dewiki
Spanish (Español) http://wikidashboard.appspot.com/eswiki
French (Français) http://wikidashboard.appspot.com/frwiki
Russian (Русский) http://wikidashboard.appspot.com/ruwiki
Italian (Italiano) http://wikidashboard.appspot.com/itwiki
Portuguese (Português) http://wikidashboard.appspot.com/ptwiki
Polish (Polski) http://wikidashboard.appspot.com/plwiki
Dutch (Nederlands) http://wikidashboard.appspot.com/nlwiki
Korean (한국어) http://wikidashboard.appspot.com/kowiki

Bongwon Suh
http://www.parc.com/suh
@billsuh http://twitter.com/billsuh

Open data manipulation and visualization: Challenges

2010-09-02T11:16:00.000-07:00

I typically blog about research results here, but here is one post that's more conversational, and discussion oriented. My good friend m.c. shraefel asked me a question via email: "What are 1 or 2 key priorities you think must be addressed that will aid citizen focused manipulation of open data sources for personal/social knowledge building?"

Here is my answer to her:

The issues you raised was precisely the inspiration for my Ph.D. Thesis work on creating a visualization spreadsheet. From over 10 years ago, the idea was that if people can easily use spreadsheets, then they ought to be able to take that model further and start creating visualizations using them, and the thesis was an exploration to find out how to design such systems. I think of ManyEyes, and Jeff Heer's later works to be in the same direction.

We have since learned a lot about user contributed content on systems like Wikipedia, Delicious, Twitter, and they show a very interesting participation architecture that consists of readers, contributors, and leaders. Not all users want to be leaders, and not all users want to contribute. We have sometimes use the derogatory term of "lurkers" to describe "readers", which I think is a bit unfair. Ronald Burt's work have shown that a lot of us would like to be brokers of information among social groups, but there are also need for an audience, or followers, who might become brokers later, but not everyone all at once.

I believe that data manipulation of open data sources to follow the same curve. Yes, some cancer patients will want to read all they can about their condition, and do the analytical work, and others (not necessarily because of tool limitations) would prefer to take a backseat, and let others curate the information for them. What's interesting is that they might want very simple interactions that enable for basic sorting of data, or maybe even services that interpret the data for them (e.g. doctors), but they would prefer someone else does the bulk of the work (even if it becomes very easy, due to tool research and development).

Given that, what can we do?

First, it's quite clear that much of the hard work remains in data import and cleaning. To democratize data analytics and manipulation, the bulk of the difficulty is dealing with data acquisition. Unfortunately, most of this is engineering and not sexy research, so there aren't really innovative work in this area, but some information extraction (AI-style algorithms, and some machine learning techniques) are making some inroad in this area. I also believe that mixed-initiative research for data import is sorely needed. We're doing a bit of this work in my lab at the moment.

Second, there is the issue of data literacy. What kind of visualization works with what kind of data? What analytic technique is appropriate. Early work by Jock Mackinlay (from our old UIR research group) pointed to the possibility of automating some of these design choices in his Ph.D. research, and we haven't made a huge amount of progress in this area since then. He is now at Tableau software trying to solve some of these issues. Wizards, try-visualization-refine loops have all been tried in research. We need to stop inventing new visualizations, but actual usable tools for people here. By going to vertical domains, we will learn how to solve this problem.

These two are the biggest problems, IMHO. Of course, there are other technical challenges such as data scale and compute power, security, privacy, and social sharing, which are all fascinating, and research such as ManyEyes have done a lot to teach us a few things about these issues.

Want to be Retweeted? Add URLs to Your Tweets!

2010-08-16T13:54:00.000-07:00

In my previous post, I described a recent study [1] in which we found that including hashtags in a tweet may enhance the retweetability of the tweet. In this post, I will focus on another factor that might affect the retweetability: URL.

As reported in my previous post, we collected a random sample of public tweets from Twitter's Spritzer feed over a 7-week period, yielding about 74 million tweets. From these tweets, we identified 8.24 million of them as retweets. That is, 11.1% of the 74 million tweets are retweets.

Next, we searched for those tweets and retweets that contain at least one URL. We found that 21.1% of tweets and 28.4% of retweets include URLs, suggesting that a tweet with URLs is more likely to get retweeted.

We further investigated whether the retweetability of a tweet has anything to do with the type of website it refers to. Since most of the URLs included in tweets are shortened URLs, we first expanded the abbreviated URLs into their original URLs, and then extracted the domain names from the original URLs. For example, given an abbreviated URL http://bit.ly/c1htE cited by a tweet, we first unshortened it to http://en.wikipedia.org/wiki/URL_shortening, and then extracted the domain name of en.wikipedia.org. The URL domains are indicative of the type of content sources visited and shared by Twitter users.

Analyzing the 74 million tweets, we identified the 20 most popular URL domains referred to in our tweets and the number of tweets containing each URL domain:

Rank	URL Domain	Number of Tweets
1	twitpic.com	793,680
2	myloc.me	533,082
3	www.facebook.com	481,349
4	www.youtube.com	475,509
5	formspring.me	455,377
6	www.twitlonger.com	349,760
7	tweetphoto.com	258,049
8	youtu.be	196,557
9	twitcam.com	159,684
10	url4.eu	145,656
11	twitter.com	144,002
12	www.plurk.com	127,037
13	fun140.com	113,153
14	www.formspring.me	100,111
15	bit.ly	94,505
16	foursquare.com	90,328
17	www.ustream.tv	83,486
18	tinychat.com	80,406
19	blip.fm	74,647
20	www.funwebsites.org	52,148

On the other hand, the following table shows the 20 most popular URL domains cited in our 8.24 million retweets and the number of retweets containing each URL domain:

Rank	URL Domain	Number of Retweets
1	www.twitlonger.com	236,435
2	twitpic.com	129,692
3	myloc.me	121,950
4	www.youtube.com	79,404
5	www.facebook.com	55,186
6	tweetphoto.com	49,676
7	twitter.com	39,127
8	mashable.com	17,778
9	bit.ly	16,406
10	www.ustream.tv	9,638
11	www.nytimes.com	9,035
12	shar.es	8,636
13	url4.eu	8,213
14	dealspl.us	8,186
15	www.flickr.com	7,599
16	www.cnn.com	7,537
17	youtu.be	7,508
18	www.etsy.com	6,828
19	ax.itunes.apple.com	6,346
20	www.huffingtonpost.com	6,332

As can be seen, these two lists of URL domains do not match each other exactly. For example, formspring.me appears only in the first list, while mashable.com appears only in the second list. That is, the fact that a website is frequently cited in the tweets does not guarantee that it is also frequently referred to in the reweets, and vice versa.

For each URL domain, we computed a retweet rate by dividing the number of retweets containing the domain by the number of tweets containing the domain. We then normalized the rate so that a value of 1.0 represents the average retweet rate of 11.1%. For example, for twitpic.com, the retweet rate of 1.47 was calculated as (129,692/793,680)*(74/8.24). A URL domain with a retweet rate higher than 1.0 indicates that, compared to the average case, the tweets containing this domain have a higher chance of getting retweeted. The following table shows the retweet rates for the 10 most popular URL domains cited in our tweets:

Rank	URL Domain	Retweet Rate
1	twitpic.com	1.47
2	myloc.me	2.05
3	www.facebook.com	1.03
4	www.youtube.com	1.50
5	formspring.me	0.05
6	www.twitlonger.com	6.07
7	tweetphoto.com	1.73
8	youtu.be	0.34
9	twitcam.com	0.12
10	url4.eu	0.51

As can be seen from the above table, the retweet rates vary greatly depending on the URL domains. For example, formspring.me, which is the 5th most popular domain, has a retweet rate of 0.05, suggesting that tweets containing that domain are very unlikely to be retweeted. On the other hand, the retweet rate of twitlonger.com is 6.07, suggesting that tweets containing that domain have high retweetability.

In the following plot, we show the retweet rates of the 50 most popular URL domains. The X-axis is the popularity rank of URL domains based on how many tweets contain each domain. The Y-axis represents the retweet rates of domains as computed above.

Overall, we see that not all popular URL domains in tweets are popular in retweets. The domain of URLs also matters.

References
[1] Suh, B., Hong, L., Pirolli, P., and Chi, E. H. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network. To appear in SocialCom'10.

Want to be Retweeted? Add Hashtags to Your Tweets!

2010-08-09T13:25:00.000-07:00

In a recent study, Bongwon Suh, Peter Pirolli, Ed H. Chi, and I examined what factors might affect retweetability of a tweet. We will report our findings in an upcoming paper this August in the Second IEEE International Conference on Social Computing [1]. In this post, I focus on a factor that we found to correlate with retweetability: hashtag.

Before I dive into the details, here is some information about the dataset that we used in our study: From Twitter's Spritzer feed, we collected a random sample of public tweets from January 18, 2010 to March 8, 2010, yielding about 74 million tweets. That is, we collected about 1.5 million tweets per day, representing approximately 2-3% of 50 million tweets appearing on Twitter daily.

For each of these 74 million tweets, we scanned for a variety of retweet markers such as "RT @", "RT:@", "retweeting @", "retweet @", "via @", "thx @", "HT @", and "r @" [2]. We found that there are about 8.24 million retweets, accounting for 11.1% of all the tweets. Next, we searched for those tweets and retweets that contain at least one hashtag. We found that 10.1% of tweets and 20.8% of retweets include hashtags, suggesting that a tweet with hashtags is more likely to get retweeted.

We further investigated whether the retweetability of a tweet has anything to do with the type of hashtag it contains. Analyzing the 74 million tweets, we identified the 20 most popular hashtags used in our tweets and the number of tweets containing each hashtag:

Rank	Hashtag	Number of Tweets
1	#nowplaying	355,147
2	#ff	224,760
3	#jobs	124,728
4	#fb	87,959
5	#tinychat	67,225
6	#vouconfessarque	51,578
7	#fail	49,248
8	#tcot	47,394
9	#1	47,373
10	#followfriday	39,986
11	#news	38,573
12	#shoutout	30,633
13	#tweetmyjobs	30,594
14	#bbb	28,590
15	#haiti	28,563
16	#letsbehonest	27,926
17	#iranelection	27,611
18	#quote	27,541
19	#followmejp	25,940
20	#follow	24,166

On the other hand, the following table shows the 20 most popular hashtags used in our 8.24 million retweets and the number of retweets containing each hashtag:

Rank	Hashtag	Number of Retweets
1	#ff	62,331
2	#vouconfessarque	43,628
3	#nowplaying	29,846
4	#tcot	18,527
5	#idothat2	16,583
6	#ohjustlikeme	16,531
7	#jafizisso	15,564
8	#haiti	13,829
9	#retweetthisif	12,602
10	#iranelection	12,334
11	#quote	11,475
12	#followfriday	11,170
13	#fb	10,994
14	#ihatequotes	9,982
15	#fail	9,759
16	#omgthatssotrue	9,286
17	#1	9,124
18	#terremotochile	8,892
19	#p2	8,719
20	#follow	8,084

As can be seen, these two lists of hashtags do not match each other exactly. For example, #jobs appears only in the first list, while #idothat2 appears only in the second list. That is, the fact that a hashtag is frequently used in the tweets does not guarantee that it is also frequently used in the reweets, and vice versa.

For each hashtag, we computed a retweet rate by dividing the number of retweets containing the hashtag by the number of tweets containing the hashtag. We then normalized the rate so that a value of 1.0 represents the average retweet rate of 11.1%. For example, for #nowplaying, the retweet rate of 0.75 was calculated as (29,846/355,147)*(74/8.24). A hashtag with a retweet rate higher than 1.0 indicates that, compared to the average case, the tweets containing this hashtag have a higher chance of getting retweeted. The following table shows the retweet rates for the 10 most popular hashtags used in our tweets:

Rank	Hashtag	Retweet Rate
1	#nowplaying	0.75
2	#ff	2.49
3	#jobs	0.16
4	#fb	1.12
5	#tinychat	0.04
6	#vouconfessarque	7.59
7	#fail	1.78
8	#tcot	3.51
9	#1	1.73
10	#followfriday	2.51

In the following plot, each point represents an individual hashtag. The X-axis is the popularity rank of hashtags based on how many tweets contain each hashtag. The Y-axis represents the retweet rates of hashtags as computed above.

From the figure, we see that the retweet rates vary greatly. Not all popular hashtags in tweets are popular in retweets. The type of hashtag does matter.

References
[1] Suh, B., Hong, L., Pirolli, P., and Chi, E. H. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network. To appear in SocialCom'10.
[2] boyd, d., Golder, S., and Lotan, G. Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter. Proc. HICSS'10, 1-10.

Grounded research on Enterprise2.0: the creation of a social information stream browser

2010-08-04T16:09:00.000-07:00

For a number of years, because of our research group's context within a large corporation like Xerox, we have been studying the effectiveness of Enterprise2.0 tools. As Web2.0 consumer tools have changed over time, so has Enterprise2.0 tools. We recently described one such tool that is applicable to both consumer and enterprise users at the AVI conference that was held in Italy in May.

The primary challenge in doing Enterprise2.0 research is the need to ground the research in real data, real user behaviors, and real practices. The class of knowledge workers that has emerged after the proliferation of Web2.0 and Enterprise2.0 tools is distinctly different from past knowledge workers.

We conducted field studies of two groups of senior professionals, and found that the primary challenge for them was far beyond information overload. The knowledge workers now face not just information overload, but also channel overload. That is, they must understand the intricacies of different channels, how often their co-workers pay attention to those channels, and therefore, adjust their strategy for contributing the right content at the right time in the right places. They use these different channels to monitor status, progress updates of both individual as well as group activities, and they use these tools to forage and organize new information.

There is much detail in the research, including a step in which we first characterized the user behaviors and challenges, then a design iteration that was done with paper prototypes, and finally a software prototype was built and evaluated. To make a long story short, we found a number of important requirements for new Enterprise2.0 tools that we've summarized into a table below:

From this set of requirements, we decided to tackle the issue around channel overload via a faceted-search browser for social information streams. The figure below shows an example screenshot of our FeedWinnower system:

We previously blogged about FeedWinnower in April. As shown in the Figure above, we extract a number of meta-data from the social information stream, such as the author of the postings, the source and media types, as well as the topic of the posting, and the time when the posting was made.

Of course, we are not the only ones to have realized these needs. Email is one of the oldest social information streams. Neustaedter et al. [1] found that sender, receiver, and time were main attributes that people used to judge the importance of email. Whittaker et al. [2] noted that filing messages in folders is time-consuming and can be problematic if users’ focus changes frequently, suggesting the need for flexible interfaces to allow on-the-fly browsing of content. Hearst suggested that social tags provide an excellent basis for the formation of topic structures for faceted browsing [3], but stressed that acquisition of facet metadata is a problem remaining to be addressed. Related research also includes the design of blog search and browsing interfaces [4]. Hearst et al. [4] suggested design choices such as “the temporal/timelines aspect of blogging” and “automatic creation of a feed reader on the subtopics of interest”. Baumer and Fisher [5] proposed an interface for organizing blogs around a list of extracted topics. Probably the most closely related work is the tool by Dork et al. [6], which organizes RSS feeds along three dimensions: time, location, and tags. It also supports a faceted browsing interface. They assumed that feed items have titles and descriptions, time of creations, locations, and tags. A key difference is that we make no assumptions about the presence of tags or manually added metadata. Instead, we construct the topic facet from the content of the items.

In summary, studying two communities from a large IT enterprise, we characterized the work practices and information-management needs of a growing class of busy knowledge workers. We found that they need:

information aggregated across multiple channels, including the combination of content and status updates,
filters that help to easily find important content, and
organization and sharing functions for individual and collaborative sensemaking.

I have a feeling that building these types of information stream interfaces will be the subject of our tool research in the group for some time.

References
[1] Neustaedter, C., Brush, A., and Smith, M. Beyond “From” and “Received”: Exploring the Dynamic of Email Triage. Proc. CHI’05, 1977-1980.

[2] Whittaker, S. and Sidner, C. Email Overload: Exploring Personal Information Management of Email. Proc. CHI’96, 276-283.

[3] Hearst, M. UIs for Faceted Navigation: Recent Advances and Remaining Open Problems. Proc. 2008 Workshop on Human-Computer Interaction and Information Retrieval.

[4] Hearst, M, Hurst, M., and Dumais, S. What Should Blog Search Look Like? Proc. 2008 ACM Workshop on Search in Social Media, 95-98.

[5] Baumer, E. and Fisher, D. Smarter Blogroll: An Exploration of Social Topic Extraction for Manageable Blogrolls. Proc. HICSS’08.

[6] Dork, M., Carpendale, S., Collins, C., and Williamson, C. VisGets: Coordinated Visualizations for Web-based Information Exploration and Discovery. IEEE Trans on Vis. and Computer Graphics, 14(6), 1205-1212.

Model-Driven Research in Social Computing

2010-06-14T11:50:00.000-07:00

I'm in Toronto attending the Hypertext 2010 conference, where I gave the keynote talk at the First Workshop on Modeling Social Media yesterday. I want to document a little bit of the points I made in the talk here.

The reason we seek to construct and derive models is to predict and explain what might be happening in social computing systems. For social media, we seek to understand how these systems evolve over time. Constructing these models should also enable us to generate new ideas and systems.

As an example, many have proposed a theory of influentials that identifying a small group of individuals who are connected to the larger social network just in the right way, we can infect or reach the rest of the people in the network. This idea is probably most well-known in the press by the popular book Tipping Point by Gladwell. This model of how information diffuse in social networks is very attractive, not just due to its simplicity, but also the potential of applying this idea in areas such as marketing.

Models such as this are meant to be challenged and debated. They are always strawman proposals. Duncan Watts' simulation on networks have shown that the validity of this theory is somewhat suspect. Indeed, recently, Eric Sun and Cameron Marlow's work, published in ICWSM2009, showed that this theory of influentials might be wrong. They suggest that "diffusion chains are typically started by a substantial number of users. Large clusters emerge when hundreds or even thousands of short diffusion chains merge together."

Most, if not all, models are wrong. Some models are just more wrong than others. But models still serve important roles. They might be divided into several categories:

Descriptive Models describe what is going on within the data. This might help us spot trends, such as the growth of number of contributors, or trending topics in a community.
Explanatory Models help us explain what might be the mechanisms underlying processes in the system. For example, we might be able to explain why certain groups of people contribute more content than another group.
Predictive Models help us engineer systems by predicting what users and groups might want, or how they might act in systems. Here we might build probabilistic models of whether a user will use a particular tag on a particular item in a social tagging system.
Prescriptive Models are set of design rules or a process that helps practitioners generate useful or practical systems. For example, Yahoo's Social Design Patterns Library on Reputation is a very good example of a prescriptive model.
"Generative Models" actually have two meanings depending on who you're talking to. In statistical circles, "generative models" are models that help generate data that look like real user data and are often probabilistic models. Information Theory is a good example of this approach, in fact. Generative Models could also mean that they are models that help us generate ideas, novel techniques and systems. My work with Brynn Evans on building a social search model is an example of this approach.

In the talk, I illustrated how we have modeled the dynamics in the popular social bookmarking system, Delicious, using Information Theory. I also showed how using equations from Evolutionary Dynamics we were better able to explain what might be happening to Wikipedia’s contribution patterns. Talk Title: Model-driven Research for Augmenting Social Cognition

Model-Driven Research in Social Computing

View more presentations from Ed Chi.

Ushahidi: A crowdsourcing site you probably have not heard of

2010-05-14T19:43:00.000-07:00

Us research scientists always go after the latest and greatest shiny cool thing to study on the Web (like us with Wikipedia), but of course, the real world is full of chaos, anger, fear, and all the unpleasant things we all prefer to forget about. What can the Social Web, Collective Intelligence, and Utopia have possibly anything to do with all that?

Ushahidi is a "platform that allows anyone to gather distributed data via SMS, email, or the web, and visualize the data on a map or timeline." The goal is to "create the simplest way of aggregating information from the public for use in crisis response." In March, while I was on an around-the-world trip to Beijing and Amsterdam, I read this article in the NYTimes about Ushahidi, and thought about how work like this reaffirms my belief that the Social Web is changing how information is distributed and used in the world, and that it is revolutionary. Ushahidi (which means testimony in Swahili) has now been used in the Kenya's disputed election in 2007 (documenting the violence) as well as Haitian and Chilean earthquakes. "It collected more testimony with greater rapidity than any reporter or election monitor." "The site collected user-generated cellphone reports of riots, stranded refugees, rapes and deaths and plotted them on a map, using the locations given by informants."

Wow!

Let's think for a second about what happened. Someone (Ory Okolloh) who cared about what's happening in Kenya blogged about what was happening, and thought about how a web application could change the transparency of the events to be visible to the world, and then tech geeks read her post and build the web site over a long weekend. Then boom! The world changes.

Shocking.

Why did it work? What's the participation architecture? And what role did technology play in this? Clearly, attention around an event was aggregated, and this came as a result of a confluence of events. The participation architecture relied on the fact that people cared enough about what's happening to build the system, and the people on the ground to have the right technology to report the events to the website, and technology enabled the mapping of these events to a map. Viola! Mass data visualization results.

Amazing.

Amazing because this happened in such a distributed fashion. No government agency got involved, and no centralized authority coordinated the work over a multi-year government grant. Now this has been exported back to the USA and in Washington D.C., the system was used to warn about dangerous roads during the big snowstorm.

The same snow storm that caused the Technology mediated Social Participation workshop to move the date of the 2nd East Coast Workshop. Ironic, isn't it?

Ironic also because this was almost precisely what I had proposed to a Gov't funding agency program manager who visited PARC about 3 years ago (Aug 2007) who was interested in disaster response. She never followed up and we weren't funded on the idea. But here are a few slides from that presentation:

ASC Proposal for Disaster Response Research in Aug. 2007

View more presentations from Ed Chi.

I think Ushahidi is awesome. Ushahidi happened because of people believed and cared about what is happening in the world. That, to me, is the power of the social web.

Short and Tweet: Experiments on Recommending Content from Information Streams (Specifically, Twitter)

2010-04-21T18:22:00.000-07:00

Information streams have recently emerged as a popular means for information awareness. By information streams we are referring to the general set of Web 2.0 feeds such as status updates on Twitter and Facebook, and news and entertainment in Google Reader or other RSS readers. More and more web users keep up with newest information through information streams. At the CHI2010 conference, we presented a new system called Zerozero88.com that recommends contents (particularly URLs that have been posted in Twitter) to users based on their profile on Twitter. Through recommender systems, we hope to better direct user attention to the most interesting URLs that are posted on Twitter that the user should pay attention to.

As a domain for recommendation, information streams have three interesting properties that distinguish them from other well-studied domains:

Recency of content: Content in the stream is often considered interesting only within a short time of first being published. As a result, the recommender may always be in a “cold start” situation, i.e. there is not enough data to generate a good recommendation.
Explicit interaction among users: Unlike other domains where users interact with the system as isolated individuals, with information stream users explicitly interact by subscribing to others’ streams or by sharing items.
User-generated content: Users are not passive consumers of content in information streams. People are often content producers as well as consumers.

In a modular approach, we explored three separate dimensions in designing such a recommender: content sources, topic interest models for users, and social voting:

Content Sources: Given limited access to tweets and processing capabilities, our first design question is how to select the most promising candidate set of URLs to consider for recommendations. We chose two strategies: First, Sarwar et al. [1] have shown that by considering only a small neighborhood of people around the end user, we can reduce the set of items to consider, and at the same time expect recommendations of similar or higher quality.

Second, we also considered a popularity-based URL selection scheme. URLs that are posted all over Twitter are probably more interesting than those rarely mentioned by anyone.
Topic Modeling: Using topic relevance is an established approach to compute recommendations. The topic interest of a user is modeled from text content the user has interacted with before, and candidate items are ranked by how well they match the topic interest profile of the user. Another way to model the user's interest is by modeling the topics of the tweets made by the people she follows.
Social Voting: Assuming the user has a stable interest and follows people according to that interest, people in the neighborhood should be similar minded enough so that voting on the neighborhood can function effectively. However, the “one person, one vote” basis in the approach above may not be the best design choice in Twitter, because some people may be more trustworthy than others as information sources. Andersen et al. discussed several key insights in their theory of trust-based recommender systems [2], one of which is trust propagation. Intuitively, trust propagation means my trust in Alice will increase when the people whom I trust also show trust in Alice. Following this argument, a person who is followed by many of a user’s followees is more trustworthy as an information source, and thus should be granted more power in the voting process.

The figure below describes the overall design of the system. The URL Source selectors from the lower left are content items that feed into the system to be ranked. The left side of the system does the topic modeling, which can come from either the user's own tweets, or the followee's tweets. The social voting model is implemented using modules on the right.

We implemented 12 recommendation engines in the design space we formulated above, and deployed them to a recommender service on the web to gather feedback from real Twitter users. The best performing algorithm improved the percentage of interesting content to 72% from a baseline of 33%.

Overall, we found that:

The social voting process seems to contribute the most to the recommender accuracy.
The topic models also contribute to the accuracy, but modeling using the user's self tweets is more accurate (with the caveat that the user actually tweets, not merely listen by following people).
Selecting URLs based on the neighborhood seems to work better than globally popular URLs, but the results are not yet statistically significant.
The best performing algorithm is FoF-Self-Vote (that is, using the neighborhood for URL content sources, self-tweets for topic modeling, and social voting.)

You can try out the beta system at http://zerozero88.com, but since it is still in beta, we can probably only enable the accounts of a limited number of people who sign up.

You can also read more about our results in the published paper [3].

Update 2010-08-23: Slides available here.

References

[1] Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J. 2002. Recommender systems for large-scale ECommerce: Scalable neighborhood formation using clustering. In Proc of ICCIT 2002.

[2] Andersen, R., Borgs, C., Chayes, J., Feige, U., Flaxman, A., Kalai, A., Mirrokni, V., and Tennenholtz, M. 2008. Trust-based recommendation systems: an axiomatic approach. In Proc of WWW ‘08.

[3] Chen, J., Nairn, R., Nelson, L., Bernstein, M., and Chi, E. 2010. Short and tweet: experiments on recommending content from information streams. In Proceedings of the 28th international Conference on Human Factors in Computing Systems (Atlanta, Georgia, USA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, 1185-1194. DOI= http://doi.acm.org/10.1145/1753326.1753503

Information Stream Overload

2010-04-12T22:33:00.000-07:00

Information overload is a growing threat to the productivity of today’s knowledge workers, who need to keep track of multiple streams of information from various sources. RSS feed readers are a popular choice for syndicating information streams, but current tools tend to contribute to the overload problem instead of solving it. Ironic, isn't it?

A significant portion of the ASC team is here in Atlanta to present work related to this information overload problem, and I will blog about it in the next week or so.

Tomorrow, we will be presenting a paper on FeedWinnower, an enhanced feed aggregator that helps readers to filter feed items by four facets (topic, people, source, and time), thus facilitating feed triage. The four facets corresponds to the What, When, Who, and When questions that govern much information architecture design. The combination of the four facets provides a powerful way for users to slice and dice their personal feeds.

First, a topic panel allows users to drill down into the specific topics that she might be interested in:

Second, a people panel allows filtering on the source of the person who created the information item in the stream:

Third, a source panel allows filtering of the type of information stream the item came from:

And finally, a time panel allows filtering for a particular time period that you might be interested in out of the information stream:

Usage Scenarios
By combining the four facets, users can examine and navigate their feeds, deciding what items to skip and what to read. Here we give two illustrative real-world scenarios.

Scenario 1: At the end of a workday, Mary opens FeedWinnower to get a sense of what has been happening around her. Using the time facet, she finds out that 507 items came into her account earlier in the day. Glancing at the topic facet, she sees “iphone” and a few other topics being talked about. As she clicks on “iphone”, the right screen shows only 7 items after filtering out other items. In the people facet, she identifies that these 7 items came from 4 of her friends and decides to read those items in detail.

Scenario 2: John wants to find out what his friends have been chatting about on Twitter lately. He selects “Twitter” in the source facet and chooses “yesterday” in the time facet. This yields 425 items. In the people facet, he then excludes those creators that he wants to ignore, filtering down to 324 items. Looking at the topic facet, he sees “betacup” and wonders what it is about. After clicking on “betacup” and reading the remaining 7 items, he now has a fair understanding about the term “betacup”.
In these two scenarios, we see how the four facets enable users to construct simple queries to accomplish their needs. We also see how the topic facet is essential in obtaining an overview of the topical trends in the feeds and helping users to decide what is worth reading in depth.

The paper reference is:
Hong, L., Convertino, G., Suh, B., Chi, E. H., and Kairam, S. 2010. FeedWinnower: layering structures over collections of information streams. In Proceedings of the 28th international Conference on Human Factors in Computing Systems(Atlanta, Georgia, USA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, 947-950. DOI= http://doi.acm.org/10.1145/1753326.1753466

Wikipedia's People-Ware Problem

2010-03-08T18:57:00.000-08:00

Last week, we hosted a visit from the Wikimedia Foundation on issues relating to our work on community analytics, and what it tells us about Wikipedia's problems and possible solutions. Naoko Komura (pictured at right) of the Wikimedia Usability Initiative, as well as Eric Zachte, the staff data analyst (also pictured at right), spoke very eloquently about how we can create social tools to direct the best social attentions to the needed parts of Wikipedia.

Fundamentally, Wikipedia has always had a "people-ware" problem: the distribution of the expertise that is freely donated to the right places. It has been and always will remain its greatest challenge. The amazing thing about Wikipedia is that it managed to do this for so long, such that a valuable knowledge repository can be built up as a result. At first, people simply came because it was the place to be. Now, we have to work a little harder.

We spent a lot of time talking about the best way to model this people-ware problem, either using biological metaphors (evolutionary systems with various forces), or economic models (see last post here). However, one thing to be aware of is the danger of "analysis paralysis", where you spend so much time analyzing the problem, and forget that there are already many ideas that have been generated for moving the great experiment forward.

For example, there are many places in Wikipedia that are not well populated. It's well-known that many scientific and math concept articles, for example, could use an expert-eye to catch the errors and explain the concepts better. How can we build an expertise finder that would actually invite people to fix problems that we know exists in Wikipedia?

Another idea might be to have the whole system be more social. Chris Grams blogs about a part of this idea here. We suggested some time ago to have a system like WikiDashboard, where you actually show the readers what the social dynamics have been for a particular article.

Wikipedia was created in 2001, when social web was still in its infancy. During the ensuing 9 years, it has changed very little, and I would argue Wikipedia have not kept up with the times. Lots of "Social Web" systems and new cultural norms have been built up already. For example, I suspect that many of us would not mind at all to reveal our identities on Wikipedia, and we might like to login with our OpenIDs and even have verified email addresses so that the system can send me verification/clarification/notification messages. The system perhaps should connect with Facebook, so that my activities (editing an article on "Windburn") is automatically sent to my stream there. My friends, upon seeing that I have been editing that article, might even join in.

I think that Wikipedia is about to change, and it is going to become a much more socially-aware place. I certainly hope that they will tackle the People-Ware (instead of the Tool-Ware) problems, and we will see it become an exciting place again.

The problem of matching social attention and products...

2010-03-04T15:44:00.000-08:00

Many people have already stolen the attention-scarcity ideas from Herb Simon and said that the most important problem in our information overloaded society is the efficient distribution of attention. What some have called the "attention economy" is nothing more than a re-packaging of this idea.

In business, of course, getting the consumers' attention is quickly becoming an important aspect of being successful. Traditional ways of getting people's attention is through advertisement, and we have witnessed a dramatic transformation of how advertisements work in the online world in the last decade, from display advertising to search advertising and, more recently, further to action advertising. Increasingly, we can tie advertising dollars to direct consumer action.

For us, it was not a stretch, then, to start thinking about how the consumer actions are starting to quickly feedback to product design. Thus, we now have people talking about crowdsourced product designs. The most agile companies now listen to the consumers via channels such as Facebook, Twitter, and Blog analytics. They do this via services such as brand management consultants and sentiment analysis tools, so much so, they are able to discern tiny changes in consumer awareness of product issues and their desires.

We know also that traditional economic models serves to optimize the distribution of products to people who want them. But these models have also recently been used to optimize the distribution of people's attention to products that might serve their needs. The two usages obviously goes hand-in-hand.

If we can help companies to serve people attention spots just-in-time with the best products, we would have a highly optimized economy that wastes little energy in distributing worthless advertisements (or spam). In fact, the existence of spam points to the inefficiencies in the economic system.

Turns out that versions of this problem exists everywhere in the Web2.0 world:

The problem of efficiently distributing the best tweets to the people who want to view them is a version of this attention distribution problem. Any time you see a tweet that was worthless to you is an opportunity for optimization.
The problem of pointing experts to the most valuable articles that they can contribute to in Wikipedia is another version.

Solutions to these problems might take the form of recommendation systems or filtering systems, but might also be efficient interactive browsing systems (for products in an online store like Amazon, or articles in Wikipedia). Some thought experiments:

What if we can design an expertise finding system that recommends the best articles for you to contribute to in Wikipedia? Would it increase participation rates?
What if we analyze your social network everyday and tell you the best tweets that you should spend five minutes on? Would more people retweet more often?
What if product designers are better tuned to trending topics and needs, would they enable companies to succeed more often? Are companies like Zazzle and Cafepress the prototype examples of lubricating this path?

Your thoughts?

What are big research problems in Social Web technologies?

2010-01-20T17:09:00.000-08:00

Just finished reading Dion Hichcliffe's piece over at ZDNet on emerging technologies for Social Web in 2010. I have been reading all these different predictions to see how it relates to our research agenda. Dion's piece is long, but several points resonated with what we have been doing:

First, he said that one problem we have is

"Poor integration between social media and location services. Again, while there’s already some location awareness in social networking services today, there’s a long way to go before it’s integrated meaningfully into the social experience to provide real utility."

I agree wholeheartedly. Not too long ago, I participated in a research project here at PARC called Magitti, which was an activity recommender that modeled your content interests, your schedule, your location, as well as the your personal history on the mobile device [1]. The integration of personalization and social features with location-aware services will be a significant trend in 2010, and there will be a lot of good research and products in this area.

Second, he said that people are having difficulties in

"coherently engaging in social activity across many channels. Tired of the day-long round-robin between your e-mail, SMS, Twitter, Facebook, and any other services you use to keep up with what’s going on? You’re not the only one. While aggregation services such as Friendfeed potentially cut down on the manual effort of using the social Web, it’s still not mainstream despite being a good example of what’s possible. Notably it’s often the big (and closed) social silos that are causing the problem."

Our group was an early adopter of FriendFeed, and realized that many of the issues relating to social annotation, commenting, and other interactions were due to the distributed nature of social media. It is hard to keep track of who said what, and the aggregate reactions to content. Our research group has some investments in this research problem, which relates to aggregation and the ability to browse and filter the feeds. We are about to publish a paper in CHI2010 about how to use faceted browsing techniques to partially solve this problem [2].

Finally, the most important point he made was the our need in

"Coping with and getting value from the expanding information volume of social media. We’re all learning how to deal with the firehose of information that flows out of social media on a minute-by-minute basis. Sometimes it’s hard to remember that this flow of transparent and open information is actually good and often useful and creates important conversations. But the simple fact is that much of it isn’t meant for non-stop, instantaneous consumption [emphasis added]; it simply isn’t practical. Rather, social media leaves behind artifacts and information that we can find and use later when we need them. But at the moment the process of sorting through, aggregating, and filtering the vast volume of information cascading through social media today remains a real and growing challenge. I also began to get the first real reports that this is happening in the enterprise last year as social media begins to grow there as well."

Here ASC group's investment in summarization, recommendation, and personalization, etc, hopefully will pay off. Our investments have been in understanding particularly how to apply these techniques in social media, with the added social contexts and new data mining techniques around social streams. Research-wise, we will be pushing on this last point the most, and I believe it is also the area we most likely can extract user value. We are about to publish a paper at CHI2010 on how to do recommendations on Twitter network [3].

I will blog about these research efforts soon.

----
[1] Victoria Bellotti, James Bo Begole, Ed H. Chi, Nicolas Ducheneaut, Ji Fang, Ellen Isaacs, Tracy King, Mark Newman, Kurt Partridge, Bob Price, Paul Rasmussen, Michael Roberts, Diane J. Schiano, Alan Walendowski. Activity-Based Serendipitous Recommendations with the Magitti Mobile Leisure Guide. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008), pp. 1157-1166. ACM Press, 2008. Florence, Italy.

[2] Hong, L.; Convertino, G.; Suh, B.; Chi, E. H.; Kairam, S. FeedWinnower: layering structures over collections of information streams. Submitted and accepted to ACM CHI2010.

[3] Chen, J., Nairn, R., Nelson, L., Chi, E. H. Short and Tweet: Experiments on Recommending Content from Information Streams. Submitted and Accepted to ACM CHI2010.

A Study on Efficient Diffusion of News in an Organization

2009-12-14T18:23:00.000-08:00

[joint work between Les Nelson, Rowan Nairn, Ed H. Chi]

In our knowledge economy, enterprises’ competitiveness often depend on the efficiency in which important news travels to the right people at the right times. Knowledge workers depend now heavily on communication channels both inside and outside the enterprise to be kept up to date on the most important information, such as the latest news on competitors, memos on human resources, status of business proposals, and the progress of workflows. The efficiency of news spread in an organization determines not just how the organization might absorb and make sense of the information, but also how it might decide to respond and react.

For example, one study of how email impacts an organization showed that one piece of email may create an organizational footprint that is 30 times larger [1]. A large body of literature surrounds the issue of news flow in organizations, including information seeking, organizational memory, and expertise location. For example, more specific to organizational information flow, sociological research shows that there is greater homogeneity of information within groups of people than between groups of people [2].

News in general is about the communication of current events, where the timeliness of the information is key. ‘Timeliness’ might not necessarily be limited to just up-to-the-minute, ‘breaking’ news. For example, one interviewee in one of our studies recently said: “It's about the leading edge of something. Staying current in a professional sense, I go through bouts of finding information. And I share it”. In the organization, this may constitute keeping up with information for ‘knowing what’ is happening and ‘knowing how’ to do things.

How can organizations better respond to the complex social and technical situation involved in staying current in their areas of business? With respect to news at work, what roles, tools, and practices might we expect in the brokering of news?

INTERVIEW STUDY

We recently conducted an interview study within our research organization. The company is an established research organization, having approximately 200 staff members in one location. Most employees belong to an approximately 5- to 10-person group (we will call this a ‘team’) organized into 4 larger multi-team groups. Each employee has an office, generally located near the rest of his or her group.

The company uses wikis for project and group knowledge repositories. The project wikis typically receive brief but intense activity (e,g, collecting web links on a topic), and then lapse into occasional use. Group wikis are updated infrequently, usually when there is organizational change (e.g., new projects and people). External blogs on topic areas promoted by the company are encouraged. Internal blogs receive infrequent use for general information sharing on topics of wide interest. Microblogging (e.g., Yammer.com) was tried early, but did not persist.

Participants were chosen from a range of positions and tenure with the company, including staff members involved in the primary business production, service people in support of the staff members (e.g., marketing, administrators, staff services), managers, and executive level managers.

16 interviews were conducted in peoples’ offices, starting with a critical incident style interview on the most recent news events received, and then followed by explicit probing to elicit different ways in which news arrives, frequency of such news, and who was involved.

STUDY FINDINGS

We have found a relatively mature practice of relying on the communication channels most commonly used at work, such as email and face-to-face. News not only travel along social networks in the organization, but also there is a strong effort in passing along news that known to be relevant. People are conservative in their choices. Moreover, people tune their social network to ensure they receive the appropriate news.

We find three major ways the company responds to getting receiving and transmitting news:

(1) Email is indeed the channel and medium of choice for news [3];

The figure below shows the frequency in which various ways of passing news back and forth are mentioned in the interview study. Although we find that news arrives and is diffused by many channels, with different levels of timeliness and audience, the primary means of communication is email (either directly or via company mailing lists) and face-to-face conversations in offices, hallways, and at lunch.

(2) News follows peoples’ social/work networks, and there is a strong effort to pass along only news seen as relevant to others;

People filter news streams for their peers as a part of their ongoing conversations at work. The filtering includes quality assessments, time investment appropriate for relaying the news, uniqueness of the news:

One subject said to us,

"I have to read it [news related email] to find out if it is unique enough. I do try to filter if it is worth forwarding. There is a huge quality assessment thing, because I would hate clogging peoples’ streams. I would probably send it to people who are actually engaged in a conversation of this type."

(3) People structure their news networks to get news conveyed in short paths of only the ‘necessary, but sufficient’ recipients. They do this by structuring the channel so that it produces quality news, finding ways to avoid unnecessary communication, or setting up shortest paths.

For example, one subject said on who to follow in Twitter:

"I went through [lots of] phases. Imagine a spiral. I could overhear conversations and pick up derivative connections. Then it got to be a little overwhelming so I went and winnowed those down... and again. The people you follow dictate the information you get. And there were three factors. One is how informative or interesting they were to my interests. The second one was how frequently they updated. If they updated 50 times a day I couldn’t keep up with that. And the third reason is strategically, who I want to build a relationship with".

DESIGN IMPLICATIONS

We take from our findings above the following requirements for systems aimed at work news propagation:

1. Integrate into the email habitat to maximize chances of adoption;

2. Facilitate also putting news receivers in control. While email has its advantages, it is in some sense a sender-controlled system;

3. Allow targeting to continue but increase the chance of serendipitous but relevant connections in a way that keeps the social paths for news short and efficient;

4. Enhance the ability to target news to others without overloading email further;

5. Allow the emergence of shared interest spaces.

References

[1] IDC white paper "The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth Through 2011", 2008.

[2] Burt, R. S. 2004. Structural holes and good ideas. American Journal of Sociology 110, 2 (September), 349–99.

[3] Ducheneaut, N. ; Bellotti, V. Email as habitat: An exploration of embedded personal information management. ACM Interactions. 2001 September-October; 8 (5): 30-38.

Technology Mediated Social Participation Workshop at PARC next week

2009-12-03T09:41:00.000-08:00

Earlier this year PARC and Univ. of Maryland approached NSF with the idea of funding two workshops on Technology-Mediated Social Participation. NSF eagerly provided funding and simultaneously started a new program on Social-Computational Systems (SoCS).

Technology Mediated Social Participation
WEST COAST WORKSHOP & PANEL - Dec 10-11 @ PARC
With the goal of drawing up a strong scientific research agenda and educational recommendations necessary for a new era of social participation technologies, PARC is hosting the first of two workshops designed to bring together a diverse set of researchers from a variety of disciplines.

The West Coast Workshop will focus on three major themes:

Integration of theory: from individual behavior to collective action
Social intelligence and capital: understanding connections
Research challenges: shareable infrastructure, ethics, and protection

In addition, Peter Pirolli and Jenny Preece will be hosting a special PARC Forum on Technology Mediated Social Participation on Thursday December 10, featuring panelists Ben Shneiderman, Amy Bruckman, Bernardo Huberman, and Cameron Marlow. The Forum will be streamed and recorded as well. We'll be live-testing/ soft launching the PARC Forum livestream at: http://www.justin.tv/parcinc

Check out extensive blog post by Peter here.

Twitter hashtag is #TMSP.

Workshop event page: Event - Technology Mediated Social Participation Workshop - PARC (Palo Alto Research Center)

Part 4 on WikiSym paper: A proposed modified model of Wikipedia Growth

2009-10-20T19:37:00.000-07:00

As mentioned in the first post on the slowing growth rate of Wikipedia, it appears that article growth reached a peak around 2007. Rather than exponential growth, it appears that Wikipedia display logistic growth. A hypothetical logistic Lotka-Volterra population growth model bounded by a limit K is shown in the following Figure:

A hypothetical logistic Lotka-Volterra population growth model bounded by a limit K.

The above figure was generated by a Lotka-Volterra population model that assumes a resource limitation K. This K variable is known as the carrying capacity, which is the limit of the population growth. Translated into our case and using the articles as the stand-in for a population, this is the maximum number of articles that Wikipedia might reach eventually. This limit might be reached because knowledge below a threshold of notability are not eligible to become an encyclopedia entry, or that there are no one around in the community who knows enough about the subject to write it up.

In either case, according to this model, at the early stages of population growth the growth rate appears exponential, but the rate decelerates as it approaches the limit K. If the total amount of encyclopedic knowledge were some constant K, then the write-up of that knowledge into Wikipedia might be expected to follow a logistic such as this above Figure.

But there is a general sense that the stock of knowledge in the world is also growing. For instance, studies of scientific knowledge (e.g., [13][23]) suggest that it exhibits exponential growth. Also, events in the world (e.g., the election of Barack Obama or Lindsey Lohan’s rehabilitations) create new possibilities for write-up.

A possible modification to the logistic growth model is as follows: We suggest that if the total amount of knowledge exhibited some monotonic growth as a function of time, K(t), one might expect a variant of logistic growth as depicted in the Figure below:

A hypothetical Lotka-Volterra population growth model bound by a limit K(t) that itself grows as a function of time.

As originally recognized by Darwin in relation to the growth of biological systems [7], competition (the “struggle for existence”) increases as populations hit the limits of the ecology, and advantages go to members of the population that have competitive dominance over others. By analogy, we suggest that:

(a) that the population of Wikipedia editors is exhibiting a slowdown in its growth due to limited opportunities to make novel contributions, and

(b) the consequences of these (increasing) limitations in opportunities will manifest itself in increased patterns of conflict and dominance.

The limitations in opportunities might be the result of multiple and diverse constraints. For example, on one hand, we expect that the capacity parameter K is determined by limits that are internal to the Wikipedia community such as the number of available volunteers that can be coordinated together, physical hours that the editors can spend, and the level of their motivation for contributing and/or coordinating.

On the other hand, we expect that the capacity depends also on external factors such as the amount of public knowledge (available and relevant) that editors can easily forage and report on (e.g., content that are searchable on the web) and the properties of the tools that the editors and administrators are using (e.g., usability and functionalities).

In summary, globally, the number of active editors and the number of edits, both measured monthly, has stopped growing since the beginning of 2007. Moreover, the evidence suggests they follow a logistic growth function.

Our paper will finally be presented by Bongwon Suh at the WikiSym 2009 conference. The citation and link to the full paper is:
Bongwon Suh, Gregorio Convertino, Ed H. Chi, Peter Pirolli. The Singularity is Not Near: Slowing Growth of Wikipedia. In Proc. of WikiSym 2009, (to appear). Oct, 2009.

Thanks goes to my co-authors, who should receive equal credit for this research!

PART 3: Population Shifts in Wikipedia

2009-09-22T23:55:00.000-07:00

The research done at ASC continues to get more press, including Time magazine, NYTimes, Repubblica [Italian Newspaper]. We have been busy trying to put together a bunch more academic papers on Web2.0 (particularly some Twitter research we have been doing), so we haven't updated this blog in a while. I figure today I'd take some time and blog a bit more about our results.

To investigate which factors affected the slowdown in edit growth, we examine the evolution of the population of active editors. The stalled growth of edit activities that we have described might be partially explained by changes in the editor population. We use the same editor classification as previous posts to count the number of active editors in each month. The figures below show three views of the evolution of the population of the five editor classes.

Monthly active editors by editor class. (This is a breakdown of the total editor population depicted earlier)

The Figure above shows the monthly frequencies of active editors by class. As expected from the power law distribution, the distribution of editors is very skewed: most of the editors contribute very few edits and very few editors contribute most of the edits. In fact, the two most prolific classes of editors (100-999 and 1000+) account for only about 1% of the population, but they contribute about 55% of edits (33% and 23% respectively).

Monthly active editors by user class. The vertical axis uses a logarithmic scale.

The Figure above uses a logarithmic scale to show the consistent slowdown of the growth among all editor classes over time, which is not clear in the first figure for editors in 100-999 and 1000+ classes. The monthly population of active editors stops growing after March 2007: a surprisingly abrupt change in the evolution of the Wikipedia population for all the editor classes. This change is consistent with the slowdown of the editing activity shown in Part 1.

[Interesting enough, even though we see that that the number of 1000+ class of editors plateaued, we know from Part 2 that this class of users have been increasing their contribution rate. Their average monthly edits per editor for the years 2005 to 2008 were 1740, 1859, 1869, and 2095, respectively.]

Percentages of monthly active editors by their class. Note that the graph is truncated to highlight the declining population of 10-99 editor classes [shown in purple]. (Sorry that the coloring of the editor classes is not consistent from the earlier plots.)

The last Figure shows the percentage of monthly active editors among the five classes. Note that the Y-axis is truncated: it omits the bottom 50% which represents the very long tail of once-monthly-editors. Notice how the 10-99 editor class [shown in purple] is being squeezed and becoming a small portion of the overall population. The 10-99 editor class went from 9% in 2005 to 6% in 2008.

A healthy community requires that people can move from novice contributors to occasional contributors to elite contributors. In other words, the upward mobility of the contributors is important for a healthy community. The trend here suggest that there are some resistance in moving beyond the 10-99 edits/month barrier. Could this be evidence of the Wiki-lawyering barriers?

One theory that I might suggest is that we want a well-balanced pyramid structure in the community population. Not too top heavy, and not too bottom heavy, and with a healthy middle class. How can we design the mechanisms [incentives and appropriate barriers] on the site so that we have this structure?

PART 2: More details of changing editor resistance in Wikipedia

2009-08-07T19:09:00.000-07:00

In the last week, we have received interesting press coverage in New Scientist (as well as Fast Company, Business Insider, and syndicated elsewhere), on the work done in our team on Wikipedia growth rate, and how it has plateaued, changing from an exponential growth model to one that look more linear. Even though this wasn't necessarily new finding, but it was really a teaser for some other observations we have found in the Wikipedia data that is about to be published in WikiSym2009 conference in October.

In the figure below, we see how the slowdown in growth of Wikipedia activity, specifically around different editor classes is different. For each month, we first partition the editors into different classes based on their monthly editing frequency. We then compare the total edit activities among the different editor classes over time.

Monthly edits by user class (in thousands).

[Consistently with the power law, we classified users using an exponential scale: we defined the classes of editors using powers of 10, e.g. 10^0, 10^1, 10^2. This resulted in five classes of users for each month: editors contributing 1 edit (i.e., 10^0), 2 to 9 edits (2-9 class), 10 to 99 (10-99 class), 100 to 999 (100-999 class), and more that 1000 edits (1000+ class).] Note that the classification of the editors was recalculated for each month.

Since the beginning of 2007, the trends of four classes slightly decrease their monthly edits. In contrast, only the highest-frequency class of editors (1000+ edits, dark blue line) shows an increase in their monthly edits.

Another way to look at this data is to analyze the relative amount of activities for each editor class by transforming the data into percentages of the total edits. The figure below complements the information in the figure above by showing the percentage of the volume of edits that each class contributes in relation to the total.

Monthly percentage of edits by each user class.

The two highest frequency classes of editors account for more than half of the total monthly edits (56% from 01/2005 to 08/2008). Furthermore, since 2005 the proportion of contributions by the highest-frequency editor class has increased slightly. In fact, the editors in 1000+ class have kept producing at an increasing rate over the past four years (their average monthly edits per editor for the years 2005 to 2008 were 1740, 1859, 1869, and 2095, respectively).

We now focus on specific evidence about what might have contributed to such slowdown. Revert is the action of deleting a prior edit. The following figure shows the percentage of edits that were reverted (reverted edits) monthly for each editor class. Note that edits related to vandalism and edits performed by robots are excluded.

Monthly ratio of reverted edits by editor class

This illustrates two indicators of a growing resistance from the Wikipedia community to new content.

First, the figure shows that the total percentage of edits reverted increased steadily over the years. The total percentage of monthly reverted edits (see dashed black line) has steadily increased over the years for the all classes of editors (e.g. 2.9, 4.2, 4.9, and 5.8 percent of all edits for 2005 through 2008 as shown by the dash line).

Second, more interestingly, low-frequency or occasional editors experience a visibly greater resistance compared to high-frequency editors [see the top two reddish lines, as compared to other lines]. The disparity of treatment of new edits from editors of different classes has been widening steadily over the years at the expense of low-frequency editors.

We consider this as evidence of growing resistance from the Wikipedia community to new content, especially when the edits come from occasional editors.

PART 1: The slowing growth of Wikipedia: some data, models, and explanations

2009-07-22T19:10:00.000-07:00

In September of 2008, we blogged about a curious change in Wikipedia that we didn't know how to explain that we had known for a while, and the ASC group has been looking into understanding this change in the last 6-9 months or so. The change that we were curious about was that the growth rates of Wikipedia have slowed. We were not the only ones wondering about this change. The Economist (archived here), for example, wrote about it.

We are about to publish a paper in WikiSym 2009 on this topic, and I thought we should start to blog about what we found.

Monthly edits and identified revert activity

The conventional wisdom about many Web-related growth processes is that they're fundamentally exponential in nature. That is, if you want some fixed amount of time, the content size and number of participants will double. Indeed, prior research on Wikipedia has characterized the growth in content and editors as being fundamentally exponential in nature. Some have claimed that Wikipedia article growth is exponential because there is an exponential growth in the number of editors contributing to Wikipedia [1]. Current research show that Wikipedia growth rate has slowed, and has in fact plateaued (See figure at right). Since about March of 2007, the growth pattern is clearly not exponential. What has changed, and how should we modify our thinking about how Wikipedia works? Prior research had assumed Wikipedia works on a "edit begets edit" model (That is, a preferential attachment model where the more an article gets edits, the more likely it would receive more edits, and thus resulting in exponential growth [2].) Such a model does not preclude some ultimate limitation to growth, although at the time it was presented [2] there was an apparent trend of unconstrained article growth.

Monthly active editor - number of users who have edited at least once in that month

The number of active editors show exactly the same pattern. The 2nd figure on the right shows how since its peak in March 2007 (820,532), the number of monthly active editors in Wikipedia has been fluctuating between 650,000 and 810,000. This finding suggests that the conclusion in [1][2] may not be valid anymore. We have a different process going on in Wikipedia now.

Article growth per month in Wikipedia. Smoothed curves are growth rate predicted by logistic growth bounded at a maximum of 3, 3.5, and 4 million articles.

Some Wikipedians have modeled the recent data, and believe that a logistic model is a much better way to think about content growth. Figure here shows that article growth reached a peak in 2007-2008 and has been on the decline since then. This result is consistent with a growth processes that hits a constraint – for instance, due to resource limitations in systems. For example, microbes grown in culture will eventually stop duplicating when nutrients run out. Rather than exponential growth, such systems display logistic growth.

We will continue to blog about what we believe might be happening in the next few weeks, as we find time to summarize the results.

[1] Almeida, R.B.m, Mozafari, B., and Cho, J., On the evolution of Wikipedia. ICWSM 2007, Boulder, Co., 2007.
[2] Spinellis, D., and Panagiotis, L. The collaborative organizations of knowledge. Communications of the ACM, 51(8), 68-73, 2008.

Social attention and interactions are key to learning processes

2009-07-20T16:02:00.000-07:00

I just finished reading a long article in the journal Science on how social factors are increasing recognized as extremely important in a new science on learning [1].

Learning is fundamentally a social activity, the article partially argued. "Social cues highlight what and when to learn." Meltzoff et al. summarize a whole slew of recent research that showed how young infants learn by imitation and copying others actions, and they build abstractions and models of others' behaviors. In fact,

"Children do not slavishly duplicate what they see but reenact a person’s goals and intentions. For example, suppose an adult tries to pull apart an object but his hand slips off the ends. Even at 18 months of age, infants can use the pattern of unsuccessful attempts to infer the unseen goal of another. They produce the goal that the adult was striving to achieve, not the unsuccessful attempts."

One point made in the article is how much the greater environment outside of school is becoming an important part of the ecology of learning.

"Elementary and secondary school educators are attempting to harness the intellectual curiosity and avid learning that occurs during natural social interaction. The emerging field of informal learning is based on the idea that informal settings are venues for a significant amount of childhood learning. Children spend nearly 80% of their waking hours outside of school. They learn at home; in community centers; in clubs; through the Internet; at museums, zoos, and aquariums; and through digital media and gaming."

Social learning, of course, is a major part of the social web. Wikipedia was designed to be an easy-to-use and freely available reference, and all of the social interactions offered by various online forums are rapidly becoming a part of the educational experience for secondary school pupils. I would argue, for example, that Wikipedia has done more for continuing education for all adult learners than any educational institution could have done by itself. ASC's research have purposefully been focused on learning and information access, instead of entertainment, because of our recognition of the importance of social factors in various kinds of learning.

As an example, social learning was explicitly part of the design of our SparTag.us prototype, which is now just being offered in limited beta software to Firefox users, was announced at the recent CHI2009 conference. It streams the annotations you make as you browse the web. The stream is collected into your notebook, and by default this stream of annotation is made available to anyone interested in it. This makes it possible to aggregate social attention later.

[1] Foundations for a New Science of Learning. A. N. Meltzoff, P. K. Kuhl, J. Movellan and T. J. Sejnowski. Science, 325 (5938), 284-288. [DOI: 10.1126/science.1175626].

[2] Photo: Alan Decker and the Machine Perception Lab, UC San Diego.

Historical Roots behind TagSearch and MrTaggy

2009-07-17T11:25:00.000-07:00

Boing Boing recently covered our work on TagSearch algorithm, and the MrTaggy prototype. We built the prototype to show how ideas from search in the past are relevant in the new world of "social search".

Boing Boing published the following response after asking us about the historical roots of TagSearch algorithm and the MrTaggy UI:

Several pieces of earlier important PARC work inspired the TagSearch algorithm and MrTaggy's user interface and experience.

First, one of the most efficient ways of browsing and navigating toward a desired information space was illustrated by the pioneering research on Scatter/Gather, a collaborative project on large-scale document space navigation between amazing researchers such as Doug Cutting (of Lucene, Hadoop fame) and Jan Pedersen (chief scientist at AltaVista, Yahoo, Microsoft for search). The research done in early to mid 90s, showed how a textual clustering algorithm can be used to quickly divide up an information space (scatter step), ask the user to specify which subspaces they're interested in (gather step). By iterating over this process, one can very quickly narrow down to just the subset of information items they're interested in. Think of it as playing 20 questions with the computer.

Second, also around the mid-90s, an important information access theory was being developed at PARC in our research group called Information Foraging, which showed that you can mathematically model the way people seek information using the same ecological equations used to model how animals forage for food. We noticed that we can use information foraging ideas to model how people used Scatter/Gather to browse for information. It turns out that it was possible to predict how people use the information cues (which we called 'information scent') in each cluster to determine whether they were interested in the contents inside the cluster. It turns out that Scatter/Gather can be shown to be a very efficient way to communicate to the user the topic structure of a very large document collection. In other words, people learned the structure of the information space much more efficiently using Scatter/Gather interfaces.

I hope it is quite clear that the relevance feedback mechanisms are very much inspired by Scatter/Gather. The related tags communicate the topic structure of what's available in the collection. Through this process, we designed MrTaggy, hoping that it would be just as efficient as Scatter/Gather in communicating the topic structure of the space.

Third, our group had developed Information Scent algorithms and concepts to build real search and recommendation systems. These algorithms build upon earlier work on a human memory model called Spreading Activation. TagSearch algorithm uses similar concepts here. It constructs a kind of Bayesian modeling of the topic space using the tag co-occurrence patterns. TagSearch's algorithm owes its heart and soul in concepts in Spreading Activation, which helps us find documents that are related to certain tags, and vice versa.

Visualization used to improve team coordination

2009-07-13T19:31:00.000-07:00

This past Thursday I spent some time at IBM Almaden research center to attend the NPUC conference, which focused on the future of software development. In computing, software development is one of the most energy intensive collaborations, and often requiring significant coordination. There are elements of competition thrown-in for good measure, and of course, everyone is working in the same workspace, which is often coordinated by version control software. Sounds quite like Wikipedia, doesn't it?

One of the interesting talks given at NPUC was Gina Venolia's talk on using visualizations to represent the structure of the code. This representation can be used individually to make sense of the system, as well as being used by a team to explain structure to others. As a map to the system, they help anchor conversations between developers by providing for an intermediate representation of the knowledge structure that they must share for effective coordination.

This is a fascinating area to think about how augmented social cognition ideas could provide for better tools for collaborative software development. For example:

* Each developer could get an color on the map. Overlaps between two developer can then be easily visualized to see areas where they need to coordinate in the past and in the future.

* Building up an understanding from the code of who is working with whom, annotations and comments made by one developer could be send over to another developer's map when code is checked into the system.

* Social analytic can be used to discover where developers are clashing with each other (like how we have discovered conflicts in Wikipedia).

Microsoft Research indeed has been thinking about awareness tools toward this direction. A project named FASTDash works to increase awareness between developers in software teams. Lots of exciting possibilities!

Live data again: WikiDashboard visualizes the editing patterns of 'David Rohde' case...

2009-06-29T20:26:00.000-07:00

Yesterday, NYTimes finally broke the silence on the kidnapping of David S. Rohde by the Taliban. Turns out, Rohde had escaped, and that the news media finally reported the kidnapping since the publicity on the case would no longer be a bargaining chip for his captors. The NYTimes article showed how keeping this news off of Wikipedia was nearly impossible if it weren't for the coordinated effort of several administrators and Jimbo Wales himself.

WikiDashboard visualized this editing pattern directly. In the figure below, I've highlighted the various edit wars between the anonymous editors (97.106.51.95; 97.106.45.230; and 97.106.52.36, which are believed to be the same person) and some of the administrators such as Rjd0060 and MBisanz and the involvement of a robot XLinkBot. You can also see the huge attention on this article in the last week or so in the visualization.

Check out the editing history and the edit war in detail by reading the edit history.

All of this makes for a great way for us to announce that WikiDashboard now works on the live Wikipedia data again; Thanks to the heroic efforts of Bongwon Suh in my group. He figured out how to execute his SQL query in a quick way on the new DB server.