Thursday, April 16, 2009

Mapping the Contents in Wikipedia

Having just returned from CHI2009 conference on Human-Computer Interaction, many of the topics there focused on where and how people obtain their information, and how they make sense of it all. A recent research topic in our group is understanding how people are using Wikipedia for their information needs. One question that had constantly come up in our discussion around Wikipedia is what is exactly in it. We have so far done most of our analyses around edit patterns, but not so much analysis have gone into what do people write about? What topics are the most well-represented? Where topic areas have the most conflict?

In one of our recent CHI2009 papers, we explored this issue. Turns out that Wikipedia have these things called Categories, which people use to organize the content into a pseudo-hierarchy of topics. We devised a simple path-based algorithm for assigning articles to large top-level categories in an attempt to understand what topic areas are the most well-represented. The top level categories are:

Using our algorithm, the page "Albert Einstein" can be assigned to these top-level categories:

This mapping makes some intuitive sense. You can see that the impact Albert Einstein has made in various areas of our society such as science, philosophy, history, and religion. Using the same ideas and algorithm, we can now do this mapping for all of the pages in Wikipedia, and find out what top level categories have received the most representation. In other words, we can figure out the coverage of topic areas in Wikipedia.

(You may have to click on the graphic here to see it in more detail.)

We can see that the highest coverage has gone toward the top-level category of "culture and the arts" at 30%, followed by "people" 15%, "geography" 14%, "society and social science" 12%, and history at 11%. What's perhaps more interesting is understanding which ones of these categories have generated the most conflicts! We used the previously developed concept called Conflict Revision Count (CRC) in our CHI2007 paper, and showed which top level categories have the most conflicts:

In this figure, the categories are listed in order of the total amount of conflicts clockwise from "People". This means that People did receive the most amount of conflict, followed by Society and Social Sciences, etc. However, the percentages in each topic is normalized by the number of article-assignments in that topic. So the metric developed here can be interpreted as the amount of conflict in each topic that has been normalized by the size of the topic, which can be interpreted as the amount of contentious in articles of the topic.

"Religion" and "Philosophy" stand out as highly contentious despite having relatively few articles.
Turns out that "philosophy" and "religion" have generated 28% of the conflicts contentious-ness each. This is despite the fact that they were only 1% and 2%, respectively, of the total distribution of topics as shown above.

Digging into religion more closely, we see that "Atheism" have generated the most conflict, followed by "Prem Rawat" -- the controversial Guru and religious leader, "Islam" and "Falun Gong".

Wikipedia is the 8th ranked website in the world, so it is clear that a lot of people get their information from Wikipedia. The surprising thing about Wikipedia is that it succeeded at all. Common sense would suggest that an encyclopedia in which anyone can edit anything they want would result in utter nonsense. What happened is exactly the opposite: Many users and groups have gotten together to make sense of complex topics and debate with each other about what information is the most relevant and interesting to be included. This helps with us keeping sane in this information world, because we now have a cheap and always accessible content on some of the most obscure content you might be interested in. At lunch today, we were all just wondering what countries have the lowest birth rate. Well, surprise!! Of course, there is a page for that, which we found using our iPhones.

The techniques we have developed here enable us to understand what content is available in Wikipedia and how various top level categories are covered, as well as the amount of controversy in each category.

There are of course many risks in using online content. However, we have been researching tools that might alleviate these concerns. For example, WikiDashboard is a tool that visualizes the social dynamics behind how an wiki article came into its current state. It shows the top editors of any Wikipedia page, and how much they have edited. It also can show the top articles that a user is interested in.

We are considering adding this capability to WikiDashboard, and would welcome your comments on the analysis and ideas here.

All web users can guide the content in Wikipedia by participating in it. If we realized that the existence of our society depends on the healthy discourse between different segments of the population, then we will see it not just as a source of conflict, but a source of healthy discussion that needs to occur in our world. By having these discussions in the open (with full social transparency), we can ensure all points of view are represented in this shared resource. Our responsibility is to ensure that the discussion and conflicts are healthy and productive.

Kittur, A., Chi, E. H., and Suh, B. 2009. What's in Wikipedia?: Mapping Topics and Conflict using Socially Annotated Category Structure. In Proceedings of the 27th international Conference on Human Factors in Computing Systems (Boston, MA, USA, April 04 - 09, 2009). CHI '09. ACM, New York, NY, 1509-1512.


RDH(Ghost In The Machine) said...

The next logical step would be to develop some metric for measuring the intensity of conflicts, then weighing them accordingly.

While some topic categories may produce many, small conflicts, other produce few but far more intense ones.

For instance, one of the worst edit wars in Wikipedia history was over the proper name for Gdansk/Danzig. Yet geography and places generate only 2% of the conflicts.

The metric could take into account: A)The number of individuals involved, B)The conflict's duration, C) How many posts are made, D) and on how many talk pages, E) How far up the Wikipedia chain of command did it go before it was resolved.

Counting the number of individuals involved could be seen as tricky due to the use of sock puppets (illicit, multiple accounts owned by the same user, for those who don't speak Wikinese:). But I would argue that in this case it actually helps us measure conflict intensity, since one must be pretty passionate about a subject in order to go to such lengths (along with the risk of being caught and banned) in an attempt to "win".

Also, I recommend using number of posts made, rather than how much data space is used, because a few "long winded" posters could skew results and make what is actually a tea pot tempest between a few "Stem winders", look like a major conflict.

Thank you for this study and your time.

Sage said...

This is wonderful! I look forward to seeing this in WikiDashboard.

One thing I'm confused about is the description of the normalized conflict. It says that "to determine the relative degree of conflict (or “contentiousness”) per article we normalized by the
number of article-category assignments in a topic." Does that mean that random Philosophy content is expected to have twice as much conflict as random People content, but that there is 15 times as much conflict total People conflict as Philosophy conflict?

That's what I gather from the paper, but you state in the blog post that ""philosophy" and "religion" have generated 28% of the conflicts each."

Could you help me understand what "normalized conflict" means?

Lisa T. said...

I like this Ed - could you use some sort of sensitivity encoding to show the scale of intensity of conflicts e.g using the grayscales we used all those years ago in the attribute explorer you could look at how many different conflicts individuals were involved in. Not sure maybe you have this in Wikidashboard already. Anyhow I love the fact that philosophy in a tiny number of posts causes the most controversy! Maybe that is why there are so few posts they haven't got round to doing any more yet they are too busy arguing!!

Ed H. Chi said...

RDH: In our previous CHI paper, we developed a metric for measuring the intensity of conflicts. We called it the Conflict Revision Count (CRC). This was precisely how we apply it, so the graphs here contains some of the weighting you suggested. That is, a conflict with many revisions would be weighted more heavily as a conflict.

Here is the reference:

Ed H. Chi said...

The problem was that some topic categories had more articles than others. For example, there were a lot more articles on "History and Events" than on "Philosophy and thinking".

What we wanted to measure, instead, was given an article, how likely it would be for it to be contentious in that topic area. That's why we normalize the sum of the conflict scores with the number of article-category assignments in that topic (which should be almost the same as the # of articles in that topic).

I don't think I was very clear in the post about this, so I will go and change this in the text.

Ed H. Chi said...

Lisa T.:
Thanks for your suggestions. Since much of the work we do for WikiDashboard is extra-curricular (we have real jobs trying to push these kinds of technologies in the Enterprise), I am not sure how much of this will make it into WikiDashboard eventually, but we'll be trying.