Wednesday, July 22, 2009

PART 1: The slowing growth of Wikipedia: some data, models, and explanations

In September of 2008, we blogged about a curious change in Wikipedia that we didn't know how to explain that we had known for a while, and the ASC group has been looking into understanding this change in the last 6-9 months or so. The change that we were curious about was that the growth rates of Wikipedia have slowed. We were not the only ones wondering about this change. The Economist (archived here), for example, wrote about it.

We are about to publish a paper in WikiSym 2009 on this topic, and I thought we should start to blog about what we found.


Monthly edits and identified revert activity

The conventional wisdom about many Web-related growth processes is that they're fundamentally exponential in nature. That is, if you want some fixed amount of time, the content size and number of participants will double. Indeed, prior research on Wikipedia has characterized the growth in content and editors as being fundamentally exponential in nature. Some have claimed that Wikipedia article growth is exponential because there is an exponential growth in the number of editors contributing to Wikipedia [1]. Current research show that Wikipedia growth rate has slowed, and has in fact plateaued (See figure at right). Since about March of 2007, the growth pattern is clearly not exponential. What has changed, and how should we modify our thinking about how Wikipedia works? Prior research had assumed Wikipedia works on a "edit begets edit" model (That is, a preferential attachment model where the more an article gets edits, the more likely it would receive more edits, and thus resulting in exponential growth [2].) Such a model does not preclude some ultimate limitation to growth, although at the time it was presented [2] there was an apparent trend of unconstrained article growth.


Monthly active editor - number of users who have edited at least once in that month


The number of active editors show exactly the same pattern. The 2nd figure on the right shows how since its peak in March 2007 (820,532), the number of monthly active editors in Wikipedia has been fluctuating between 650,000 and 810,000. This finding suggests that the conclusion in [1][2] may not be valid anymore. We have a different process going on in Wikipedia now.


Article growth per month in Wikipedia. Smoothed curves are growth rate predicted by logistic growth bounded at a maximum of 3, 3.5, and 4 million articles.

Some Wikipedians have modeled the recent data, and believe that a logistic model is a much better way to think about content growth. Figure here shows that article growth reached a peak in 2007-2008 and has been on the decline since then. This result is consistent with a growth processes that hits a constraint – for instance, due to resource limitations in systems. For example, microbes grown in culture will eventually stop duplicating when nutrients run out. Rather than exponential growth, such systems display logistic growth.

We will continue to blog about what we believe might be happening in the next few weeks, as we find time to summarize the results.

[1] Almeida, R.B.m, Mozafari, B., and Cho, J., On the evolution of Wikipedia. ICWSM 2007, Boulder, Co., 2007.
[2] Spinellis, D., and Panagiotis, L. The collaborative organizations of knowledge. Communications of the ACM, 51(8), 68-73, 2008.

35 comments:

James Salsman said...

Which logistic curves are shown in those graphs? The standard logistic function has a horizontal asymptote, but based on the details of current Wikipedia practice, I believe you want the Gompertz function.

Ed H. Chi said...

James:

We're using the Verhulst equation used in population growth models, sometimes also attributed to Lotka. The equation models the rate of production as a function of the existing population and the amount of available resources.

James Salsman said...

Interesting. What are your projections for total content size, including Commons? I recommend staying with the Gompertz or Verhulst for that.

Anonymous said...

I would like to see graphs of the following:

a) Rate of new editors
b) Rate of attrition among old editors
c) Rates of editing among editors overall.

That would help us figure out if this is a standard growth model under resource constraints, if there are internal changes to Wikipedia affecting things, or if there are exogenous forces (more competition from other user-generated content sites?) at work.

Ed H. Chi said...

@James Existing work on Wikipedia suggest that the upper limit is somewhere between 3.5M and 4M. But I'm not sure this is correct. The classic Lotka equations suggestion a horizontal asymptote as you said. However, we believe that Wikipedia is actually entering an era of linear growth, and so made some modifications to the standard Lotka model. I'll probably blog about this next.

RDH(Ghost In The Machine) said...

@Anon
Yes that would be interesting. There seems to be a strong correlation between drops in editing/article/user growth and major Wikipedia scandals such as the Essjay Affair and last year's JimboGate.

The editing environment has also changed (it has become more combative and politically charged) though that might prove trickier to gauge by statistical metrics. PARC's ongoing studies of Wiki-conflicts could help shed some light on this aspect of the trend.

RDH(Ghost In The Machine) said...

Here's an amusing little article:

http://wikitruth.info/index.php?title=For_Whom_the_Bell_Curve_Tolls

Which cites an earlier, independent analysis by a Wikipedia member, who also caught this trend:

http://en.wikipedia.org/wiki/User:Dragons_flight/Log_analysis

Canaries in the coal mine?
;-)

Somey said...

I realize there's a need for proper scientific analysis of these things, and I don't mean to seem dismissive, but Wikipedia and its various trends aren't something you can understand through numerical analysis. Even if you're only interested in "growth," you can't account for what actually occurs. For example, a single person can create literally thousands of worthless "stub" articles in a very short period of time, just so that he can say he's created more articles than anyone else, and most of these are kept on the site - mostly because of that person's reputation for creating large numbers of articles. If that person decides to simply quit one day, you'll have a significant reduction in the new-article growth rate due solely to that single event.

Besides, it sounds like you're not taking into account the fact that humans are territorial, and the number of topic areas that can be that can be staked out as territories for individual or group control is limited. Ongoing observation suggests that this limit has already been reached, save for new areas that are added as a result of new events, inventions, discoveries, and so on. A traditional mindset might have one thinking that this would be a stabilizing factor, but on Wikipedia, it simply results in more content disputes, more conflict between users, and more gamesmanship, which is countered with more formalization and attempts to impose editorial and behavioral standards. This, in turn, further reduces new-article count.

It's especially important to note that a reduction in Wikipedia growth, by any measure, is a good thing. The more garbage you create, the bigger the mess you have, and the harder it is to dispose of it all when something better comes along.

Ed H. Chi said...

@ RDH(Ghost In The Machine)
Yes, we're aware of Dragons_flight's work. His real name is Robert Rohde (from UCBerkeley) and we had him down to our research center for a talk and lunch. We built some of our research on top of his findings.

@Anonymous
We have more data coming from the paper that answers some of your questions about plotting other statistics to find out what is really happening to the user population.

@Somey
I understand your point about measurement of an evolving system being difficult. However, there has been plenty of research in ecology and economics where you can find regularities in collective behavior when viewed both at the macro and micro scales.

I think you'd be surprised at the ability for ecological equations to model population growth, for example, after taking into account of resource constraints, current predator populations, and available food supply and other environmental factors.

Regarding your second point about being territorial, the regularity we observe in the statistics actually relies on the fact that editors are being territorial. This makes their behavior more predictable than random, and results in statistical distributions that we can observe that are different than total chaotic random distributions.

James Salsman said...

Ed, please see if you can help with the formula questioned in the recent edits to http://en.wikipedia.org/wiki/Gompertz_function

Thank you!

Piriczki said...

Wikipedia has become an absolute mess, driven by egos, incompetent administrators, and juvenile POV editors. Its articles are larded with biases and errors. This wouldnt be an issue had it been simply a social network site or blog, but to call itself an encyclopedia is a dangerous insult to knowledge. It's pretending to be something that it's not. It needs to be shut down.

Ed H. Chi said...

@Piriczki
I have to say that I disagree with you. While there are real and heated conflicts on Wikipedia, many people (including myself) have often found it to be a valuable source of reference information.

Our studies on conflicts in Wikipedia show that reverts (an indicator of conflict) is a small portion of overall activity on the site. (See the first figure and compare the larger portion of editing activity to the red portion of revert activities.)

However, I do agree with you that conflicts could be handled better in Wikipedia, and that some overzealous editors could be better controlled. But the exact mechanism for doing that is less than clear.

bellishabanera said...

I find that it's taking increasing effort to add content to Wikipedia. There are both good and bad reasons for this.

The good reason is that the standards are higher: it used to be that you could write an article based on your own knowledge and get away with it, but now editors expect citations and substantiation. Consequently, the quality of new articles is probably improving. This also calls for more experienced editors, who know how to properly make citations (for example).

The bad reason is that an increasing number of topics are guarded by POV warriors who are willing to make battle against anyone who is trying to create balance (at least) or accuracy and nuance into the articles.

Bee said...

Amazing. Is exactly what I predicted would happen. First time here (see comments).

RonaldB said...

Rather than taking the number of articles as a measure for growth, I think it is more appropriate to take the number of words. This imho is a better representation of the amount of content. See e.g. http://stats.wikimedia.org/EN/TablesWikipediaNL.htm (note: there is currently some inconsistency in the sequence of months and data for the English Wikipedia are incomplete due to past problems with database dumps).

Another phenomenon I've observed is that the flattening is happening in all languages. So saturation of the amount of content that can be published in an encyclopedia is not the explanation. The English Wikipedia is bigger because the initial community was big. The number two (German) will most likely never get to 3 million.

Therefore I agree with the observation that it is the community itself that is the cause for the slow-down.

Peter Pirolli said...

James, Ed: The logistic or gompertz functions assume underlying process models that are obviously simplifications of the interactions going on at the micro level. The trick will be to dive into the underlying processes and user interactions to come up with a model that fits.

For instance, the logistic is a characterization of just the simplest lotka-volterra population growth processes (one kind of species in a niche with finite resources). As you introduce complexities such as competition (or cooperation) among the individuals or "species", or different models of the resource constraints you can get growth dynamics that deviate from the pure logistic (and often can be chaotic).

James Salsman said...

The group of people authoring new content is a source roughly proportional to page views and growing along a traditional logistic as access to Wikipedia saturates in the human population. The tendencies toward deletionism act as a sink. Since the sink is the more complex process, it's worth modeling further, and I propose that it manifests in two component processes: judgments about what is an appropriate article and judgments about what article text is appropriate. I believe such a system in biology is very similar to what led Gompertz to propose his curve.

Mistaking an exponential for a logistic is a lot worse, but I'm from the old school that believes the degree-of-freedom-adjusted R^2 is actually the correct measure to use when searching for models to explain the observed processes.

Felipe Ortega said...

You can also find similar conclusions in a PhD. dissertation previously published this year (last April):

Thesis page:
http://libresoft.es/Members/jfelipe/phd-thesis

Manuscript (downloadable electronic version):
http://libresoft.es/Members/jfelipe/thesis-wkp-quantanalysis

This trend deserves a closer look, specially if you combine this finding about reverts with the typical profile of quality content editors and the demography analysis of Wikipedia community.

Many editors and sysops suggest that Wikipedia is becoming a more hostile environment for newcomers, something that may be supported by these empirical results. Nice work.

Anonymous said...

It's not uncommon for some system or entity to show different growth phenomenon at different stages. A lot of populations show exponential growth in the early stages but it's not going to remain exponential long term (lack of nutrients, lack of resources, lack of editors, systemic changes, loss of novelty value, outside influences...)

Rather that compare the data to a model and conclude what this "means" is happening, it would be better to work from looking at joiners over time, mainstream life-cycles of joiners (including their fading or leaving) and how that has changed over time. Then look at measurable data on the editing environment and how that has changed in its interactions with users.

For example only, one might find joiners have changed profile or quantity; joiners in 2003-2005 tended to join, ramp up editing, then reach stability and had a typical profile of "editing patterns" before a given profile of fading or settling down; and that has changed by 2007-2009. One might find that popularity has led to a change in profile of joiners, with more vandals, and also more "serious editors", but proportionally fewer "middle of the road" editors.

One might find that the community has become more demanding on editors or articles, or handles certain matters qualitatively or quantitatively differently, and this has affected editing.

Taken as a whole this could have created changes anywhere from decline, to "fewer edits but more dedicated and competent editors", to many other patterns. Matching a raw number to a theory won't necessarily interpret what we're seeing.

FT2

Gregory Kohs said...

Given the apparent decline in growth in early 2007, it may be instructive to look at what was happening on Wikipedia around that time.

January 1st
Wikipedia temporarily blocks the entire nation of Qatar by mistake.

January (early) (Essjay Controversy)
'Essjay' is hired by Wikia.

January 7th (Essjay Controversy)
Essjay posts autobiographical details on his user page at Wikia (not Wikipedia), giving his supposed real name (Ryan Jordan), age, and previous employment history from age 19, and his positions within various Wikimedia Foundation projects. These details differ sharply from previous assertions on 'Essjay's' Wikipedia user page about his academic and professional credentials.

January 18th
The Ottawa Citizen examines the life of Wikipedia editor and arbitrator Simon Pulsifer, making light of the fact that he is unemployed and living with his parents.

Also January 20th
Jimmy Wales reverses a previous decision ignoring two polls to the contrary, to automatically add "no follow" tags to all outward links on Wikipedia. An exception to this policy are any of the "interwiki" formatted links, which include many to Wales' privately-held website, Wikia.com.

January 24th
Microsoft employees explain that the company paid a blogger to edit certain Wikipedia pages relating to Open Office standards. According to one Microsoft employee, the step was taken to avoid Wikipedia's Conflict Of Interest policy, and because articles were previously "heavily written by people at IBM, a rival standard supporter, and that Microsoft had gotten nowhere flagging mistakes to Wikipedia’s volunteer editors."

also January 24th
Journalist Brian Bergstein interviews MyWikiBiz founder Gregory Kohs on his travails with Wikipedia over paid editing.

January 28th
Wikimedia Foundation announce the creation of an Advisory board. The board is headed by Angela Beesley, a business partner of Jimmy Wales.

February 16th
Distinguished Turkish scholar Taner Akçam is wrongly detained at the Montreal airport on the basis of false anonymous insertions in his Wikipedia biography. (see Wikipedia Review thread)

February 22nd
Fuzzy Zoeller sues a Miami firm due to defamatory posts made on Wikipedia.

February 23rd (Essjay Controversy)
Jimmy Wales announces the appointment of 'Essjay' to Wikipedia's Arbitration Committee (ArbCom). Wales later asserts that the appointment was "at the request of and unanimous support of" ArbCom.

February 26th (Essjay Controversy)
The New Yorker publishes a correction for its July 31 issue. Jiimmy Wales is quoted on Essjay's false persona, “I regard it as a pseudonym and I don’t really have a problem with it.”

March 3rd (Essjay Controversy)
After an outpouring of rage from Wikipedians, and much negative publicity in the major media, Wales asks Essjay to resign his "positions of trust". Essjay promptly retires from Wikipedia altogether and later resigns from his position at Wikia. In his initial apology, Essjay makes an extraordinary claim that New Yorker journalist Schiff had offered to pay him during his interview, which was flatly denied.

March 8th
Jimmy Wales announces plans for Wikia's proposed search engine ("Search Wikia") to rival those of Google and Yahoo. It would utterly fail, about two years later.

March 16th
Wikipedia falsely claims that US entertainer 'Sinbad' has died.

March 22nd
Brad Patrick resigns as General Counsel to the Wikimedia Foundation. Danny Wool also quits as Wikimedia Foundation "grants coordinator" and resigns his roles on Wikipedia. Both Wool and Patrick cite disagreements with the Board of Trustees.

Pekka said...

I think there is likely to be huge potential for growth in Wikipedia - especially in scientific topics. Of course bioscientists for instance are trying to create their own Wikis that are more scientifically oriented in content but this would not preclude the expansion of the general Wiki in these areas. So I don't think we are running out of topics to cover. Probably a very minuscule proportion of all human knowledge is covered by Wikipedia.

Maarten Vonder said...

There is a precedent to Wikipedia. The New York Times tried to collect its reader's knowledge about all kinds of things around 2000-2001 by a project called Abuzz. It did succeed in interesting thousands of readers all over the world to write about subjects as much apart as wine, garden, history, architecture, art, literature, lifestyle, animals, health, name it. These writers grew into a community not unlike today's Wikipedia. Ultimately, it failed for two reasons:

* it failed to cope with the amount of vandalism;

* it failed to cope with the conflicts around politics and religion.

Wikipedia succeeds in coping with these items so far. It could also fail in the near future. A research into Wikipedia would do better if it were able to take into account the influence of religion and politics. Those subjects could either obstruct growth a lot, or be a serious threat to maintenance.

Ed H. Chi said...

@Maarten: If you're interested in understanding conflicts in politics and religion in Wikipedia, you'll want to read this earlier post.

Kaldari said...

I don't see why this is so mysterious. If you want to know why the rate of article growth on Wikipedia has slowed, just try coming up with a notable topic that doesn't already have a Wikipedia article. Basically, there aren't any (unless it's a news item or something newly discovered). OK, so there are a few German folk songs from the 1600s that still don't have articles, but just wait a few months.

Peter Pirolli said...

I just want to point out that it isn't just new content that is getting less editing. I wrote an early post on Wikipedia slowdown analyses that graphed pages having different "ages" and the editing trends seem to happen for all page ages. See the graphs at

http://web.mac.com/peter.pirolli/Professional/Blog/Entries/2008/9/9_Is_Wikipedia_becoming_less_productive.html

George said...

I agree with this. There are thousands of articles I could create, and would have created in the past, on notable places, species, villages, and people - but it's not worth the effort of having to defend a good number of them against deletion, and the initial effort required to source them all to reliable sources.

Maarten Vonder said...

Ed and Peter, I just came hopping in after I was pointed out this blog, but there's discussion about this inside Wikipedia as well. First there's a heck of a difference between English Wikipedia and all others (with the exception maybe of the German). The English Wikipedia has 3 million articles, where total users are counted 10 million, of which 150,000 are shown to be "active" (http://meta.wikimedia.org/wiki/List_of_Wikipedias). The Dutch Wikipedia, 7th in the world, counts 550,000 articles and 250,000 users, of which 4700 are active. As our local statician Ronald Beelaard has shown, that last figure isn't correct: active users during the month of May was 1500. To maintain Dutch encyclopedia as it is, every user has a workload of 360 articles on average, including editing, vandalist fighting, writing new articles, and so on. Your blog part 1 showed there are between 650,000 and 810,000 active users monthly on the English-spoken wikipedia, which amount to a workload of between 3 and 4 articles per user. Still, the graphs are pretty much the same for all wikipedias and all show the same resistance since their height around 2007. Shows me that there is a factor you don't grasp as yet, that I'm very curious to learn more of.

One hypothesis is: the "free for grabs" information to be implemented has run out, leaving users to augment on existing material. If there's truth in that theory, it can be corroborated by focusing on relatively small but fast growing wikipedias. You could try: the Russian; the Turkish; the Thai; the Georgian and the Serbo-Croatian, all with a high-standard culture and pride of their home country.

The second theory: Wikipedia is on top of every page everywhere in every search engine, so it will only repeat its own content once you want to add to it. The dragon has bitten its own tail. This would cause a drastic decline in interest of potential users, because it's hardly possible to find more information via the internet. This theory could be checked, but somewhat more difficultly, by showing that the augmentation does not come from information available since 1995 - start of Internet - any more, but from books. Say, it's about man's first steps on the Moon, 1969. How much of that information is from the internet, and how much could be added from sources before the internet period?

There might be a third theory: people have just gone dead tired of the whole thing, bored out of their minds by those standard features like "Did you know" or the fact that every living or dead person is presented by the same standard formula. If you've read and edited 1574 articles about tropical reptiles, how many out there are going to edit no. 1575? Especially if your specialism is being dumped by a youngster who's read 1,5 books about the subject and you have to quarrel with him for the next 1,5 years?

I know I'm not a statistics expert, but your basics might need a little looking after. Hope you'll recognize this as positive critics, thanks for listening,

Maarten Vonder.

llywrch said...

One factor that I haven't seen mentioned but could be the cause of this slowing growth/plateauing is a simple one: the pool of potential contributors is finite. All of the people who would contribute to Wikipedia either are doing so -- or have.

That shouldn't be a shocking observation: there are only so many people interested in stamp collecting, for example. And although literacy is almost universal, the number of people who write books (whether the books are published or not) is only a tiny fraction of all of those who are capable of doing so.

If this observation is correct, then to keep Wikipedia functioning there is a more daunting challenge ahead that must be met: to keep the willing interested in volunteering, & prevent WikiBurnout.

Geoff

Rob said...

OK, so it's becoming less rewarding to post or edit on Wikipedia. I see that. Lots of reasons for it, not the least of which is just how BIG Wikipedia is now, individuals articles get less exposure, and so do individual contributors who contribute once a month. BUT ALSO, it's becoming harder to improve on the existing content, because so much of it is really very good, and often pretty thorough. The average guy on the street might find that his niche of knowledge is already well represented on Wikipedia, and there's just not much for him to add. And finally, hasn't some of the newness and excitement worn off? Isn't that inevitable?

carl said...

Two ideas:

1. Show the 'rise of templates'. Over time, what percentage of articles wear a cleanup template?
http://en.wikipedia.org/wiki/Wikipedia:Template_messages/Cleanup

2. Compute the 'net edit' for each user. This is the number of characters added or deleted, in total, over the course of all a user's edits. We could ask whether this figure has gone down for the top 10 editors over time (I will wager it has gone from positive to negative). Does this correlate to the other stats you are collecting?

-Carl

Gdgourou said...

Perhaps a good comparaison should be done with the quality, I assume that it's something very subjective.

The number of edition could be stagnant or perhaps would declined but I hope that the encyclopedic part of the site would be improved.

kettlewell said...

This really isn't a surprise to me. Part of it has to do with TOMA (top of mind awareness).

The snowball will grow all on it's own for a while to see the great big exponential curve, but peoples interests get spread out with all the other things that life has to offer, and without a marketing campaign to push TOMA, the users drop off.

It will be interesting to see if Twitter falls into the same category. Documents were revealed by TechCrunch a while back that Twitter aims to be the 1st to have 1 Billion (with a big B) users.

They currently have only 20 million, but expect much more shortly because of this exponential curve.

My guess is that they *might* hit 100 million users, but that users are going to drop off like flies to all the other fun things that life has to offer.

If WikiPedia, Twitter and others want to continue on this growth curve, they need to stay competitive, they need to market, and they need to create new and exciting products/ideas. Look at Google & Yahoo. Love em or hate em, they have a solid model of long-term growth.

Anonymous said...

You should look at page views, since that is what virtual communities really drive. You will find some numbers at stats.wikimedia.org. If people move from wikipedia to e.g. twitter and get "followers", that will lead to slowing growth and in the end to decline.

Robert W.

Anonymous said...

Yeah I used to really like contributing to Wikipedia; but it's increasingly become more Nazi like as the place is over-run by pedantic shit heads who think they can lord it over the rest of us for writing in, what they have staked out as "THEIR CLAIM".

Now that wikipedia it has become a little squat of feifdoms, the peasants are fleeing from the overlords and their oppressive bullshit.

Кино онлайн без регистрации said...

As always wonderful. Added to bookmarks.