Tuesday, January 18, 2011

"Location" Field in Twitter User Profiles (and an interesting fact about Justin Bieber)

Interest in geographic information has intensified in the last year or two. One of the ways in which people obtain geolocation data is the decoding of the "Location" field during account sign-up. Many researchers have used this field for analysis of where the users of a service might be coming from. For example, Mashable has a nice write up of services that depend on geolocation data for twitter. But little research exists on one of the most common, oldest, and most utilized forms of online social geographic information: the “location” field found in most virtual community user profiles.

Recently, our summer intern Brent Hecht, who was visiting us from Northwestern University, performed the first in-depth study of user behavior around the 'location' field in Twitter user profiles. Here is what we found.

From April 18 to May 28, 2010, we collected about 32 million English tweets from the Spritzer sample feed. Our 32 million English tweets were created by 5,282,657 unique users. Out of these users, we randomly selected 10,000 “active” users for our first study. We defined “active” as having more than five tweets in our dataset, which reduced our sampling frame to 1,136,952 users (or 22% of all users). We then extracted the contents of these 10,000 users’ location fields and placed them in a coding spreadsheet. Two coders examined the 10,000 location field entries. Coders were asked to use any information at their disposal, from their cultural knowledge and human intuition to search engines and online mapping sites.

As shown in Figure below, only 66% of users manually entered any sort of valid geographic information into the location field. This means that although the location field is usually assumed by practitioners and researchers to be a field that is as associated with geographic information as a date field is with temporal information, this is definitely not the case in our sample.

We found that 34% of users did not provide real location information, frequently incorporating fake locations or sarcastic comments that can fool traditional geographic information tools. The remaining one-third of users were roughly split between those that did not enter any information and those that entered either non-real locations, obviously non-geographic information, or locations that did not have specific geographic footprints. When users did input their location, they almost never specified it at a scale any more detailed than their city.

An analysis of the non-geographic information entered into the location field revealed it to be highly unpredictable in nature. A striking trend was the theme of Justin Bieber, who is a teenage singer. A surprising 61 users (more than 1 in 200 users) co-opted the location field to express their appreciation of the pop star. For instance, a user wrote that s/he is located in “Justin Biebers heart” and another user indicated s/he is from “Bieberacademy”. Justin Bieber was not the only pop star that received plaudits from within the location field; United Kingdom “singing” duo Jedward, Britney Spears, and the Jonas Brothers were also turned into popular “locations”.

Another common theme involved users co-opting the location field to express their desire to keep their location private. One user wrote “not telling you” in the location field and another populated the field with “NON YA BISNESS!!” Sexual content was also quite frequent, as were “locations” that were insulting or threatening to the reader (e.g. “looking down on u people”). Additionally, there was a prevalent trend of users entering non-Earth locations such as “OUTTA SPACE” and “Jupiter”.

A relatively large number of users leveraged the location field to express their displeasure about their current location. For instance, one user wrote “preferably anywhere but here” and another entered “redneck hell”.

Entering non-real geographic information into the location field was so prevalent that it even inspired some users in our sample to make jokes about the practice. For instance, one user populated the location field with “(insert clever phrase here)”.

Note that, in the 66% of users who did enter real geographic information, we included all users who wrote any inkling of real geographic information. This includes those who merely entered their continent and, more commonly, those who entered geographic information in highly vernacular forms. For example, one user wrote that s/he is from “kcmo--call da po po”. Our coders were able to determine this user meant “Kansas City, Missouri”, and thus this entry was rated as valid geographic information (indicating a location at a city scale). Similarly, a user who entered “Bieberville, California” as her/his location was rated as having included geographic information at the state scale, even though the city is not real.

Our study on the information quality has vital implications for leveraging data in the location field on Twitter (and likely other websites). Namely, many researchers have assumed that location fields contain strongly typed geographic information, but our findings show this is demonstrably false. To determine the effect of treating Twitter’s location field as strongly-typed geographic information, we took each of the location field entries that were coded as not having any valid geographic information (the 16% slice of the pie chart) and entered them into Yahoo! Geocoder. This is the same process used by Java et al. [1] A geocoder is a traditional geographic information tool that converts place names and addresses into a machine-readable spatial representation, usually latitude and longitude coordinates [2].

Of the 1,380 non-geographic location field entries, Yahoo! Geocoder determined 82.1% to have a latitude and longitude coordinate. As our coders judged none of these entries to contain any geographic information or highly ambiguous geographic information, this number should be zero (assuming no coding error). Some examples of these errors are quite dramatic. “Middle Earth” returned (34.232945, -102.410204), which is north of Lubbock, Texas. Similarly, “BieberTown” was identified as being in Missouri and “somewhere ova the rainbow”, in northern Maine. Even “Wherever yo mama at” received an actual spatial footprint: in southwest Siberia.

Middle Earth:

View Larger Map

Since Yahoo! Geocoder assumes that all input information is geographic in nature, the above results are not entirely unexpected. The findings here suggest that geocoders alone are not sufficient for the processing of data in location fields. Instead, data should be preprocessed with a geoparser, which disambiguates geographic information from non-geographic information [2]. However, geoparsers tend to require a lot of context to perform accurately. Adapting geoparsers to work with location field entries is an area of future work.

[1] Java, A., Song, X., Finin, T. and Tseng, B. Why We Twitter: Understanding Microblogging Usage and Communities. Joint 9th WEBKDD and 1st SNA-KDD Workshop ’07, San Jose, CA, 56-65.

[2] Hecht, B. and Gergle, D. A Beginner’s Guide to Geographic Virtual Communities Research. Handbook of Research on Methods and Techniques for Studying Virtual Communities, IGI, 2010.

No comments: