Monday, August 16, 2010

Want to be Retweeted? Add URLs to Your Tweets!

In my previous post, I described a recent study [1] in which we found that including hashtags in a tweet may enhance the retweetability of the tweet. In this post, I will focus on another factor that might affect the retweetability: URL.

As reported in my previous post, we collected a random sample of public tweets from Twitter's Spritzer feed over a 7-week period, yielding about 74 million tweets. From these tweets, we identified 8.24 million of them as retweets. That is, 11.1% of the 74 million tweets are retweets.

Next, we searched for those tweets and retweets that contain at least one URL. We found that 21.1% of tweets and 28.4% of retweets include URLs, suggesting that a tweet with URLs is more likely to get retweeted.

We further investigated whether the retweetability of a tweet has anything to do with the type of website it refers to. Since most of the URLs included in tweets are shortened URLs, we first expanded the abbreviated URLs into their original URLs, and then extracted the domain names from the original URLs. For example, given an abbreviated URL http://bit.ly/c1htE cited by a tweet, we first unshortened it to http://en.wikipedia.org/wiki/URL_shortening, and then extracted the domain name of en.wikipedia.org. The URL domains are indicative of the type of content sources visited and shared by Twitter users.

Analyzing the 74 million tweets, we identified the 20 most popular URL domains referred to in our tweets and the number of tweets containing each URL domain:

Rank URL Domain
Number of Tweets


1 twitpic.com 793,680





2 myloc.me

533,082



3 www.facebook.com

481,349




4 www.youtube.com

475,509





5 formspring.me

455,377





6 www.twitlonger.com

349,760





7 tweetphoto.com

258,049




8 youtu.be

196,557




9 twitcam.com

159,684







10 url4.eu







145,656







11 twitter.com

144,002

12 www.plurk.com



127,037







13 fun140.com



113,153



14 www.formspring.me



100,111



15 bit.ly



94,505



16 foursquare.com



90,328


17 www.ustream.tv



83,486



18 tinychat.com



80,406





19 blip.fm



74,647



20 www.funwebsites.org



52,148





On the other hand, the following table shows the 20 most popular URL domains cited in our 8.24 million retweets and the number of retweets containing each URL domain:
Rank URL Domain

Number of Retweets
1 www.twitlonger.com

236,435



2 twitpic.com

129,692



3 myloc.me

121,950



4 www.youtube.com

79,404



5 www.facebook.com

55,186


6 tweetphoto.com

49,676



7 twitter.com

39,127



8 mashable.com

17,778


9 bit.ly

16,406



10 www.ustream.tv







9,638





11 www.nytimes.com



9,035





12 shar.es



8,636





13 url4.eu





8,213





14 dealspl.us



8,186





15 www.flickr.com



7,599




16 www.cnn.com



7,537





17 youtu.be



7,508





18 www.etsy.com



6,828







19 ax.itunes.apple.com



6,346





20 www.huffingtonpost.com



6,332






As can be seen, these two lists of URL domains do not match each other exactly. For example, formspring.me appears only in the first list, while mashable.com appears only in the second list. That is, the fact that a website is frequently cited in the tweets does not guarantee that it is also frequently referred to in the reweets, and vice versa.

For each URL domain, we computed a retweet rate by dividing the number of retweets containing the domain by the number of tweets containing the domain. We then normalized the rate so that a value of 1.0 represents the average retweet rate of 11.1%. For example, for twitpic.com, the retweet rate of 1.47 was calculated as (129,692/793,680)*(74/8.24). A URL domain with a retweet rate higher than 1.0 indicates that, compared to the average case, the tweets containing this domain have a higher chance of getting retweeted. The following table shows the retweet rates for the 10 most popular URL domains cited in our tweets:
Rank URL Domain

Retweet Rate
1 twitpic.com

1.47

2 myloc.me

2.05
3 www.facebook.com

1.03

4 www.youtube.com

1.50

5 formspring.me

0.05
6 www.twitlonger.com

6.07



7 tweetphoto.com

1.73

8 youtu.be

0.34



9 twitcam.com

0.12



10 url4.eu







0.51






As can be seen from the above table, the retweet rates vary greatly depending on the URL domains. For example, formspring.me, which is the 5th most popular domain, has a retweet rate of 0.05, suggesting that tweets containing that domain are very unlikely to be retweeted. On the other hand, the retweet rate of twitlonger.com is 6.07, suggesting that tweets containing that domain have high retweetability.

In the following plot, we show the retweet rates of the 50 most popular URL domains. The X-axis is the popularity rank of URL domains based on how many tweets contain each domain. The Y-axis represents the retweet rates of domains as computed above.


Overall, we see that not all popular URL domains in tweets are popular in retweets. The domain of URLs also matters.

References
[1] Suh, B., Hong, L., Pirolli, P., and Chi, E. H. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network. To appear in SocialCom'10.

Monday, August 9, 2010

Want to be Retweeted? Add Hashtags to Your Tweets!

In a recent study, Bongwon Suh, Peter Pirolli, Ed H. Chi, and I examined what factors might affect retweetability of a tweet. We will report our findings in an upcoming paper this August in the Second IEEE International Conference on Social Computing [1]. In this post, I focus on a factor that we found to correlate with retweetability: hashtag.

Before I dive into the details, here is some information about the dataset that we used in our study: From Twitter's Spritzer feed, we collected a random sample of public tweets from January 18, 2010 to March 8, 2010, yielding about 74 million tweets. That is, we collected about 1.5 million tweets per day, representing approximately 2-3% of 50 million tweets appearing on Twitter daily.

For each of these 74 million tweets, we scanned for a variety of retweet markers such as "RT @", "RT:@", "retweeting @", "retweet @", "via @", "thx @", "HT @", and "r @" [2]. We found that there are about 8.24 million retweets, accounting for 11.1% of all the tweets. Next, we searched for those tweets and retweets that contain at least one hashtag. We found that 10.1% of tweets and 20.8% of retweets include hashtags, suggesting that a tweet with hashtags is more likely to get retweeted.

We further investigated whether the retweetability of a tweet has anything to do with the type of hashtag it contains. Analyzing the 74 million tweets, we identified the 20 most popular hashtags used in our tweets and the number of tweets containing each hashtag:

Rank Hashtag Number of Tweets


1 #nowplaying 355,147



2 #ff 224,760

3 #jobs 124,728



4 #fb 87,959



5 #tinychat 67,225



6 #vouconfessarque 51,578



7 #fail 49,248



8 #tcot 47,394



9 #1 47,373





10 #followfriday





39,986





11 #news 38,573
12 #shoutout

30,633





13 #tweetmyjobs

30,594

14 #bbb

28,590

15 #haiti

28,563

16 #letsbehonest

27,926

17 #iranelection

27,611

18 #quote

27,541



19 #followmejp

25,940

20 #follow

24,166



On the other hand, the following table shows the 20 most popular hashtags used in our 8.24 million retweets and the number of retweets containing each hashtag:
Rank Hashtag Number of Retweets
1 #ff 62,331

2 #vouconfessarque 43,628

3 #nowplaying 29,846

4 #tcot 18,527

5 #idothat2 16,583

6 #ohjustlikeme 16,531

7 #jafizisso 15,564

8 #haiti 13,829

9 #retweetthisif 12,602

10 #iranelection





12,334



11 #quote

11,475



12 #followfriday

11,170



13 #fb

10,994



14 #ihatequotes

9,982



15 #fail

9,759



16 #omgthatssotrue

9,286



17 #1

9,124



18 #terremotochile

8,892





19 #p2

8,719



20 #follow

8,084





As can be seen, these two lists of hashtags do not match each other exactly. For example, #jobs appears only in the first list, while #idothat2 appears only in the second list. That is, the fact that a hashtag is frequently used in the tweets does not guarantee that it is also frequently used in the reweets, and vice versa.

For each hashtag, we computed a retweet rate by dividing the number of retweets containing the hashtag by the number of tweets containing the hashtag. We then normalized the rate so that a value of 1.0 represents the average retweet rate of 11.1%. For example, for #nowplaying, the retweet rate of 0.75 was calculated as (29,846/355,147)*(74/8.24). A hashtag with a retweet rate higher than 1.0 indicates that, compared to the average case, the tweets containing this hashtag have a higher chance of getting retweeted. The following table shows the retweet rates for the 10 most popular hashtags used in our tweets:
Rank Hashtag Retweet Rate
1 #nowplaying 0.75
2 #ff 2.49
3 #jobs 0.16
4 #fb 1.12

5 #tinychat 0.04

6 #vouconfessarque 7.59

7 #fail 1.78

8 #tcot 3.51

9 #1 1.73

10 #followfriday





2.51





In the following plot, each point represents an individual hashtag. The X-axis is the popularity rank of hashtags based on how many tweets contain each hashtag. The Y-axis represents the retweet rates of hashtags as computed above.


From the figure, we see that the retweet rates vary greatly. Not all popular hashtags in tweets are popular in retweets. The type of hashtag does matter.

References
[1] Suh, B., Hong, L., Pirolli, P., and Chi, E. H. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network. To appear in SocialCom'10.
[2] boyd, d., Golder, S., and Lotan, G. Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter. Proc. HICSS'10, 1-10.

Wednesday, August 4, 2010

Grounded research on Enterprise2.0: the creation of a social information stream browser


For a number of years, because of our research group's context within a large corporation like Xerox, we have been studying the effectiveness of Enterprise2.0 tools. As Web2.0 consumer tools have changed over time, so has Enterprise2.0 tools. We recently described one such tool that is applicable to both consumer and enterprise users at the AVI conference that was held in Italy in May.

The primary challenge in doing Enterprise2.0 research is the need to ground the research in real data, real user behaviors, and real practices. The class of knowledge workers that has emerged after the proliferation of Web2.0 and Enterprise2.0 tools is distinctly different from past knowledge workers.

We conducted field studies of two groups of senior professionals, and found that the primary challenge for them was far beyond information overload. The knowledge workers now face not just information overload, but also channel overload. That is, they must understand the intricacies of different channels, how often their co-workers pay attention to those channels, and therefore, adjust their strategy for contributing the right content at the right time in the right places. They use these different channels to monitor status, progress updates of both individual as well as group activities, and they use these tools to forage and organize new information.

There is much detail in the research, including a step in which we first characterized the user behaviors and challenges, then a design iteration that was done with paper prototypes, and finally a software prototype was built and evaluated. To make a long story short, we found a number of important requirements for new Enterprise2.0 tools that we've summarized into a table below:


From this set of requirements, we decided to tackle the issue around channel overload via a faceted-search browser for social information streams. The figure below shows an example screenshot of our FeedWinnower system:


We previously blogged about FeedWinnower in April. As shown in the Figure above, we extract a number of meta-data from the social information stream, such as the author of the postings, the source and media types, as well as the topic of the posting, and the time when the posting was made.




Of course, we are not the only ones to have realized these needs. Email is one of the oldest social information streams. Neustaedter et al. [1] found that sender, receiver, and time were main attributes that people used to judge the importance of email. Whittaker et al. [2] noted that filing messages in folders is time-consuming and can be problematic if users’ focus changes frequently, suggesting the need for flexible interfaces to allow on-the-fly browsing of content. Hearst suggested that social tags provide an excellent basis for the formation of topic structures for faceted browsing [3], but stressed that acquisition of facet metadata is a problem remaining to be addressed. Related research also includes the design of blog search and browsing interfaces [4]. Hearst et al. [4] suggested design choices such as “the temporal/timelines aspect of blogging” and “automatic creation of a feed reader on the subtopics of interest”. Baumer and Fisher [5] proposed an interface for organizing blogs around a list of extracted topics. Probably the most closely related work is the tool by Dork et al. [6], which organizes RSS feeds along three dimensions: time, location, and tags. It also supports a faceted browsing interface. They assumed that feed items have titles and descriptions, time of creations, locations, and tags. A key difference is that we make no assumptions about the presence of tags or manually added metadata. Instead, we construct the topic facet from the content of the items.

In summary, studying two communities from a large IT enterprise, we characterized the work practices and information-management needs of a growing class of busy knowledge workers. We found that they need:
  1. information aggregated across multiple channels, including the combination of content and status updates,
  2. filters that help to easily find important content, and
  3. organization and sharing functions for individual and collaborative sensemaking.
I have a feeling that building these types of information stream interfaces will be the subject of our tool research in the group for some time.


References
[1] Neustaedter, C., Brush, A., and Smith, M. Beyond “From” and “Received”: Exploring the Dynamic of Email Triage. Proc. CHI’05, 1977-1980.

[2] Whittaker, S. and Sidner, C. Email Overload: Exploring Personal Information Management of Email. Proc. CHI’96, 276-283.

[3] Hearst, M. UIs for Faceted Navigation: Recent Advances and Remaining Open Problems. Proc. 2008 Workshop on Human-Computer Interaction and Information Retrieval.

[4] Hearst, M, Hurst, M., and Dumais, S. What Should Blog Search Look Like? Proc. 2008 ACM Workshop on Search in Social Media, 95-98.

[5] Baumer, E. and Fisher, D. Smarter Blogroll: An Exploration of Social Topic Extraction for Manageable Blogrolls. Proc. HICSS’08.

[6] Dork, M., Carpendale, S., Collins, C., and Williamson, C. VisGets: Coordinated Visualizations for Web-based Information Exploration and Discovery. IEEE Trans on Vis. and Computer Graphics, 14(6), 1205-1212.