Tuesday, September 22, 2009

PART 3: Population Shifts in Wikipedia

The research done at ASC continues to get more press, including Time magazine, NYTimes, Repubblica [Italian Newspaper]. We have been busy trying to put together a bunch more academic papers on Web2.0 (particularly some Twitter research we have been doing), so we haven't updated this blog in a while. I figure today I'd take some time and blog a bit more about our results.

To investigate which factors affected the slowdown in edit growth, we examine the evolution of the population of active editors. The stalled growth of edit activities that we have described might be partially explained by changes in the editor population. We use the same editor classification as previous posts to count the number of active editors in each month. The figures below show three views of the evolution of the population of the five editor classes.

Monthly active editors by editor class. (This is a breakdown of the total editor population depicted earlier)

The Figure above shows the monthly frequencies of active editors by class. As expected from the power law distribution, the distribution of editors is very skewed: most of the editors contribute very few edits and very few editors contribute most of the edits. In fact, the two most prolific classes of editors (100-999 and 1000+) account for only about 1% of the population, but they contribute about 55% of edits (33% and 23% respectively).

Monthly active editors by user class. The vertical axis uses a logarithmic scale.

The Figure above uses a logarithmic scale to show the consistent slowdown of the growth among all editor classes over time, which is not clear in the first figure for editors in 100-999 and 1000+ classes. The monthly population of active editors stops growing after March 2007: a surprisingly abrupt change in the evolution of the Wikipedia population for all the editor classes. This change is consistent with the slowdown of the editing activity shown in Part 1.

[Interesting enough, even though we see that that the number of 1000+ class of editors plateaued, we know from Part 2 that this class of users have been increasing their contribution rate. Their average monthly edits per editor for the years 2005 to 2008 were 1740, 1859, 1869, and 2095, respectively.]

Percentages of monthly active editors by their class. Note that the graph is truncated to highlight the declining population of 10-99 editor classes [shown in purple]. (Sorry that the coloring of the editor classes is not consistent from the earlier plots.)

The last Figure shows the percentage of monthly active editors among the five classes. Note that the Y-axis is truncated: it omits the bottom 50% which represents the very long tail of once-monthly-editors. Notice how the 10-99 editor class [shown in purple] is being squeezed and becoming a small portion of the overall population. The 10-99 editor class went from 9% in 2005 to 6% in 2008.

A healthy community requires that people can move from novice contributors to occasional contributors to elite contributors. In other words, the upward mobility of the contributors is important for a healthy community. The trend here suggest that there are some resistance in moving beyond the 10-99 edits/month barrier. Could this be evidence of the Wiki-lawyering barriers?

One theory that I might suggest is that we want a well-balanced pyramid structure in the community population. Not too top heavy, and not too bottom heavy, and with a healthy middle class. How can we design the mechanisms [incentives and appropriate barriers] on the site so that we have this structure?


Sabine Hossenfelder said...

Did it occur to you the stagnation might be due to the change in information that is needed for improvement?

Ed H. Chi said...

Bee: If what you mean is that the nature of the work that is needed on Wikipedia has changed over time, then yes, we have thought about it. We haven't quite figured out how to measure that, however.

We have some ideas about how to measure content shifts, see for example this post earlier. It's quite hard though to figure out whether the edits are becoming more 'maintenance' oriented. We had attempted to approximate that by looking at the number of words per edit in the past. See Figure 13 in this alt.CHI paper.

Perhaps you meant something else?

Sabine Hossenfelder said...

Yes, the maintenance is one aspect.

The other one is that there's readily available information that many people can add, but at some point what's missing is the information with the experts, in the books that few people know, in the journals that few people read (not to mention understand).

I would suspect that for this reason the updates still being made shift towards entries that have a continuous supply of cheap information. Eg, events that are happening, people that are still alive, topics that are being reported on in the media etc.

Anonymous said...

This is purely anecdotal, but it feels like Wikipedia's servers gotten much slower in the past year or so. Do you have access to data on this?

I ask because slow-to-load pages are known to cause web sites to lose significant traffic. If my personal experience is shared by others, this could be an important variable to consider in explaining why usage is plateau-ing.

Ed H. Chi said...

@Bee: We had definitely consider that as a possible explanation. What's somewhat perplexing is that if people are slowly running out of things to edit, then it should not be a phase shift. Instead, we should see a gradual leveling off in editing activity (perhaps eventually reaching a horizontal asymptote, or at least approach a more linear line of growth.)

Instead what we see if a sharp turn from exponential growth to a linear growth curve. It was a phase change. This makes that particular explanation somewhat suspect.

Ed H. Chi said...

We don't have any access to the data on web server performance on Wikipedia, unfortunately. But that's an interesting hypothesis that I hadn't thought of. It's possible that Wikipedia servers became very slow all of a sudden in March 2007, and that's what's contributing to the editing pattern shifts.

I'll have to ask the contacts I know at Wikipedia about your hypothesis.

Anonymous said...

Hi, I'm that same anonymous guy from earlier in the thread.

I thought of another anecdotal exogenous factor to consider. Five years ago, I don't remember Wikipedia search results coming up much in Google searches. Today they come up all the time.

If you do talk with Wikipedia contacts, it would be interesting to know the amount of traffic generated by search engines, and what the historical graph of that traffic looks like. I don't know quite how this would affect the numbers you're looking at, but if it has irregularities (a quick rise, a plateau, whatever) that might contribute to the curve you're seeing.

Sabine Hossenfelder said...

No, not necessarily. I suspect that if you gradually raise the level of effort people have to make to add new and useful information, it will not result in a gradual decline in their willingness to contribute, but there will be a threshold effect. Take for example the point at which new information is accessible mostly or entirely in subscription journals. This will suddenly cut off a very large group of people.

Ed H. Chi said...

@Bee: I think you have a reasonable explanation, and in fact, that was (and still is) our best explanation for the logistic growth curve. In fact, the next post on the blog will talk about the model in the paper when people start to run out of information to write about.

@Anonymous: correlating the data with visitor or search engine analytics is a very good idea. We are trying to build cooperative relationships with the WM foundation, so hopefully, if they haven't done these analyses already, we will be able to do them very soon!

Thanks for both of your suggestions!

Jon Awbrey said...


After all this time, your group still fails to distinguish between "editors" (real people) and "accounts" (anonyms & pseudonyms).

I'm afraid that all your stats will continue to be seriously compromised by a failure to establish an accurate population model.

That's Chi-cago!

Ed H. Chi said...

I really wish there was a way to distinguish the "real people" with "accounts". Alas, no data in Wikipedia allows us to do that. Do you have a suggestion about how to do that?

Eric Goldman said...

I've offered some theories to explain the slowdown in editor absorption at http://ssrn.com/abstract=1458162. Regards, Eric.

Jon Awbrey said...

Re:I really wish there was a way to distinguish the "real people" with "accounts". Alas, no data in Wikipedia allows us to do that. Do you have a suggestion about how to do that?


Before you can think about how to answer that question, you have to use language that keeps the question open.

If the "population of people" is not an observable in the current frame of observation, then it only serves to confuse things and short circuit further inquiry if you use language that suggests it is.


Jon Awbrey said...


I couldn't be sure if my last message got through, so I posted what I can remember of my reply in a Sidewiki.