Tuesday, January 25, 2011

Further details on 'Location' field behavior on Twitter

There are of course a lot more details on the 'Location' field study in the previous post, which was covered by various press outlets (Seattle PI, AllThingD, ReadWriteWeb, NYTimes.) There are several further details that're worth pondering about:

First thing is on geo-information scale. Out of the 66% of users with any valid geographic information, those that were judged to be outside of the United States were excluded from our study of scale. Users who indicated multiple locations (see below) were also filtered out. This left us with 3,149 users who were determined by both coders to have entered valid geographic information that indicated they were located in the United States.

When examining the scale of the location entered by these 3,149 users, an obvious city-oriented trend emerges (Figure below). Left to their own devices, users by and large choose to disclose their location at exactly the city scale, no more and no less. As shown in Figure below, approximately 64% of users specified their location down to the city scale. The next most popular scale was state-level (20%).

When users specified intrastate regions or neighborhoods, they tended to be regions or neighborhoods that engendered significant place-based identity. For example, “Orange County” and the “San Francisco Bay Area” were common entries, as were “Harlem” and “Hollywood”. Interestingly, studying the location field behavior of users located within a region could be a good way to measure the extent to which people identify with these places.

This might not have been a surprise. What's perhaps more interesting is the behavior around specifying multiple locations. 2.6% of the users (4% of the users who entered any valid geographic information) entered multiple locations. Most of these users entered two locations, but 16.4% of them entered three or more locations. Qualitatively, it appears many of these users either spent a great deal of time in all locations mentioned, or called one location home and another their current residence. An example of the former is the user who wrote “Columbia, SC. [atl on weekends]” (referring to Columbia, South Carolina and Atlanta, Georgia). An example of the latter is the user who entered that he is a “CALi b0Y $TuCC iN V3Ga$” (A male from California “stuck” in Las Vegas).

Looking at the 10,000 profiles we examined, the most categorically distinct entries we encountered were the automatically populated latitude and longitude tags that were seen in many users’ location fields. After much investigation, we discovered that Twitter clients such as ƜberTwitter for Blackberry smartphones entered this information. Approximately 11.5% of the 10,000 users we examined had these latitude and longitude tags in their location field. The vast majority of the machine-entered latitude and longitude coordinates had six significant digits after the decimal point, which is well beyond the precision of current geolocation technologies such as GPS. While it depends somewhat on the latitude, six significant digits results in geographic precision at well under a meter. This precision is in marked contrast with the city-level organic disclosure behavior of users.

This mismatch leads us to a fairly obvious but important implication for design. Any system automatically populating a location field should do so, not with the exact latitude and longitude, but with an administrative district or vernacular region that contains the latitude and longitude coordinate. It is likely that users would prefer not to reveal their location to such precise coordinates if they had the choice to specify the granularity.

Overall, the picture that this data paints suggest a wide variety of ways in which people wanted to communicate to others about their location. Some are at multiple locations often, while others wanted to express a cultural or neighborhood identity through their location. Users often want to have the ability to express sarcasm, humor, or elements of their personality through their location field. In many ways, this is not a surprise; people’s geographic past and present have always been a part of their identity. We are particularly interested in the large number of users who expressed real geographic information in highly vernacular and personalized forms. Designers may want to invite users to choose a location via a typical map interface and then allow them to customize the place name that is displayed on their profile. This would allow users who enter their location in the form of “KC N IT GETS NO BETTA!!” (a real location field entry in our study) to both express their passion for their city and receive the benefits of having a machine-readable location, if they so desire.

View Larger Map

Our findings also suggest that Web 2.0 system designers who wish to engender higher rates of machine-readable geographic information in users’ location fields may want to force users to select from a precompiled list of places.

People who entered multiple locations motivate an additional important implication for design. That is, to give users the ability to specify their activities in various locations, such as home, work, current, visiting city, favorite bar, etc. Other directions of future work include examining per-tweet location disclosure, as well as evaluating location disclosure on social network sites such as Facebook.

Tuesday, January 18, 2011

"Location" Field in Twitter User Profiles (and an interesting fact about Justin Bieber)

Interest in geographic information has intensified in the last year or two. One of the ways in which people obtain geolocation data is the decoding of the "Location" field during account sign-up. Many researchers have used this field for analysis of where the users of a service might be coming from. For example, Mashable has a nice write up of services that depend on geolocation data for twitter. But little research exists on one of the most common, oldest, and most utilized forms of online social geographic information: the “location” field found in most virtual community user profiles.

Recently, our summer intern Brent Hecht, who was visiting us from Northwestern University, performed the first in-depth study of user behavior around the 'location' field in Twitter user profiles. Here is what we found.

From April 18 to May 28, 2010, we collected about 32 million English tweets from the Spritzer sample feed. Our 32 million English tweets were created by 5,282,657 unique users. Out of these users, we randomly selected 10,000 “active” users for our first study. We defined “active” as having more than five tweets in our dataset, which reduced our sampling frame to 1,136,952 users (or 22% of all users). We then extracted the contents of these 10,000 users’ location fields and placed them in a coding spreadsheet. Two coders examined the 10,000 location field entries. Coders were asked to use any information at their disposal, from their cultural knowledge and human intuition to search engines and online mapping sites.

As shown in Figure below, only 66% of users manually entered any sort of valid geographic information into the location field. This means that although the location field is usually assumed by practitioners and researchers to be a field that is as associated with geographic information as a date field is with temporal information, this is definitely not the case in our sample.

We found that 34% of users did not provide real location information, frequently incorporating fake locations or sarcastic comments that can fool traditional geographic information tools. The remaining one-third of users were roughly split between those that did not enter any information and those that entered either non-real locations, obviously non-geographic information, or locations that did not have specific geographic footprints. When users did input their location, they almost never specified it at a scale any more detailed than their city.

An analysis of the non-geographic information entered into the location field revealed it to be highly unpredictable in nature. A striking trend was the theme of Justin Bieber, who is a teenage singer. A surprising 61 users (more than 1 in 200 users) co-opted the location field to express their appreciation of the pop star. For instance, a user wrote that s/he is located in “Justin Biebers heart” and another user indicated s/he is from “Bieberacademy”. Justin Bieber was not the only pop star that received plaudits from within the location field; United Kingdom “singing” duo Jedward, Britney Spears, and the Jonas Brothers were also turned into popular “locations”.

Another common theme involved users co-opting the location field to express their desire to keep their location private. One user wrote “not telling you” in the location field and another populated the field with “NON YA BISNESS!!” Sexual content was also quite frequent, as were “locations” that were insulting or threatening to the reader (e.g. “looking down on u people”). Additionally, there was a prevalent trend of users entering non-Earth locations such as “OUTTA SPACE” and “Jupiter”.

A relatively large number of users leveraged the location field to express their displeasure about their current location. For instance, one user wrote “preferably anywhere but here” and another entered “redneck hell”.

Entering non-real geographic information into the location field was so prevalent that it even inspired some users in our sample to make jokes about the practice. For instance, one user populated the location field with “(insert clever phrase here)”.

Note that, in the 66% of users who did enter real geographic information, we included all users who wrote any inkling of real geographic information. This includes those who merely entered their continent and, more commonly, those who entered geographic information in highly vernacular forms. For example, one user wrote that s/he is from “kcmo--call da po po”. Our coders were able to determine this user meant “Kansas City, Missouri”, and thus this entry was rated as valid geographic information (indicating a location at a city scale). Similarly, a user who entered “Bieberville, California” as her/his location was rated as having included geographic information at the state scale, even though the city is not real.

Our study on the information quality has vital implications for leveraging data in the location field on Twitter (and likely other websites). Namely, many researchers have assumed that location fields contain strongly typed geographic information, but our findings show this is demonstrably false. To determine the effect of treating Twitter’s location field as strongly-typed geographic information, we took each of the location field entries that were coded as not having any valid geographic information (the 16% slice of the pie chart) and entered them into Yahoo! Geocoder. This is the same process used by Java et al. [1] A geocoder is a traditional geographic information tool that converts place names and addresses into a machine-readable spatial representation, usually latitude and longitude coordinates [2].

Of the 1,380 non-geographic location field entries, Yahoo! Geocoder determined 82.1% to have a latitude and longitude coordinate. As our coders judged none of these entries to contain any geographic information or highly ambiguous geographic information, this number should be zero (assuming no coding error). Some examples of these errors are quite dramatic. “Middle Earth” returned (34.232945, -102.410204), which is north of Lubbock, Texas. Similarly, “BieberTown” was identified as being in Missouri and “somewhere ova the rainbow”, in northern Maine. Even “Wherever yo mama at” received an actual spatial footprint: in southwest Siberia.

Middle Earth:

View Larger Map

Since Yahoo! Geocoder assumes that all input information is geographic in nature, the above results are not entirely unexpected. The findings here suggest that geocoders alone are not sufficient for the processing of data in location fields. Instead, data should be preprocessed with a geoparser, which disambiguates geographic information from non-geographic information [2]. However, geoparsers tend to require a lot of context to perform accurately. Adapting geoparsers to work with location field entries is an area of future work.

[1] Java, A., Song, X., Finin, T. and Tseng, B. Why We Twitter: Understanding Microblogging Usage and Communities. Joint 9th WEBKDD and 1st SNA-KDD Workshop ’07, San Jose, CA, 56-65.

[2] Hecht, B. and Gergle, D. A Beginner’s Guide to Geographic Virtual Communities Research. Handbook of Research on Methods and Techniques for Studying Virtual Communities, IGI, 2010.