Tuesday, October 20, 2009

Part 4 on WikiSym paper: A proposed modified model of Wikipedia Growth

As mentioned in the first post on the slowing growth rate of Wikipedia, it appears that article growth reached a peak around 2007. Rather than exponential growth, it appears that Wikipedia display logistic growth. A hypothetical logistic Lotka-Volterra population growth model bounded by a limit K is shown in the following Figure:


A hypothetical logistic Lotka-Volterra population growth model bounded by a limit K.

The above figure was generated by a Lotka-Volterra population model that assumes a resource limitation K. This K variable is known as the carrying capacity, which is the limit of the population growth. Translated into our case and using the articles as the stand-in for a population, this is the maximum number of articles that Wikipedia might reach eventually. This limit might be reached because knowledge below a threshold of notability are not eligible to become an encyclopedia entry, or that there are no one around in the community who knows enough about the subject to write it up.

In either case, according to this model, at the early stages of population growth the growth rate appears exponential, but the rate decelerates as it approaches the limit K. If the total amount of encyclopedic knowledge were some constant K, then the write-up of that knowledge into Wikipedia might be expected to follow a logistic such as this above Figure.

But there is a general sense that the stock of knowledge in the world is also growing. For instance, studies of scientific knowledge (e.g., [13][23]) suggest that it exhibits exponential growth. Also, events in the world (e.g., the election of Barack Obama or Lindsey Lohan’s rehabilitations) create new possibilities for write-up.

A possible modification to the logistic growth model is as follows: We suggest that if the total amount of knowledge exhibited some monotonic growth as a function of time, K(t), one might expect a variant of logistic growth as depicted in the Figure below:


A hypothetical Lotka-Volterra population growth model bound by a limit K(t) that itself grows as a function of time.

As originally recognized by Darwin in relation to the growth of biological systems [7], competition (the “struggle for existence”) increases as populations hit the limits of the ecology, and advantages go to members of the population that have competitive dominance over others. By analogy, we suggest that:

(a) that the population of Wikipedia editors is exhibiting a slowdown in its growth due to limited opportunities to make novel contributions, and

(b) the consequences of these (increasing) limitations in opportunities will manifest itself in increased patterns of conflict and dominance.

The limitations in opportunities might be the result of multiple and diverse constraints. For example, on one hand, we expect that the capacity parameter K is determined by limits that are internal to the Wikipedia community such as the number of available volunteers that can be coordinated together, physical hours that the editors can spend, and the level of their motivation for contributing and/or coordinating.

On the other hand, we expect that the capacity depends also on external factors such as the amount of public knowledge (available and relevant) that editors can easily forage and report on (e.g., content that are searchable on the web) and the properties of the tools that the editors and administrators are using (e.g., usability and functionalities).

In summary, globally, the number of active editors and the number of edits, both measured monthly, has stopped growing since the beginning of 2007. Moreover, the evidence suggests they follow a logistic growth function.

Our paper will finally be presented by Bongwon Suh at the WikiSym 2009 conference. The citation and link to the full paper is:
Bongwon Suh, Gregorio Convertino, Ed H. Chi, Peter Pirolli. The Singularity is Not Near: Slowing Growth of Wikipedia. In Proc. of WikiSym 2009, (to appear). Oct, 2009.

Thanks goes to my co-authors, who should receive equal credit for this research!

22 comments:

James Salsman said...
This comment has been removed by the author.
randall said...

as always, ed, interesting stuff.

i wonder, do you have the means to track the extent to which contributors to wikipedia (etc.) are positively or negatively influenced to contribute further edits to wikipedia by the behavior of others in the wikipedia community? a simple example would be someone who contributed either a new article or edits to an existing article, only to stop contributing after some negative feedback (edits were reverted, article was deleted, or something along those lines).

Ed H. Chi said...

randall: Currently we don't have an automatic way to do the tracking that you mentioned, but that would be an amazing way to do the research, as it would point to the actual mechanism by which people are discouraged to contribute. I would love to do that analysis, but alas, I haven't found a way to do that.

However, some of the ethnographic studies that we have done do suggest that there are anecdotal stories that point to that as a possible mechanism and explanation for the reduced contribution rates.

You may wish to look at the strategy.wikipedia.org site for some other opinions.

James Salsman said...

Since most people probably decide how much they want to be editors after their first 5-10 editing experiences, I recommend asking why the "new articles created per day" statistic on http://stats.wikimedia.org/reportcard/ is so much more volatile (and even periodic?)

HenkvD said...

A very nice possible explanation why growth is not logistic. On the graph K(t) grows about 300.000 articles per year, but the graph is just hypothetical, as in 2009 it reaches 6M, wheras the actual article count is just over 3M. Do you have any idea how large the yearly growth of knowledge (for wikipedia articles) could be?

Ed H. Chi said...

@James: Good question on the new article per day metric. The data from wikistats is incomplete and curiously missing for the English Wikipedia, but yet the graph contains data for en.wikipedia. Will have to look into this more.

@HenkvD nice catch. We simply plotted idea, but didn't pay close attention to the actual numbers on the Y-axis. Sloppy and embarrassing.

Current estimates of new articles per month in Wikipedia is around 1K-2K articles. Interestingly enough, ja.wikipedia follow a logistic curve here also, going from 200 up to max of around 450 and down to 270-ish recently. This metric would be a very rough measure of knowledge going into the system.

HenkvD said...

@Ed, you probably mean currently 1K-2K a DAY, as for Japan 270 per day as well. What I am interested in is the estimation of K(t) factor.

Jon Awbrey said...

Lotka-Volterra?

That has something to do with Predator-Prey, don't it?

It just shows to go ya …

Keep on following Wikipedia down the Goldbrick Road of Pseud-o-Scholarship and you may actually converge on some kind of e-piphany after all …

James Salsman said...

Jon, having a model which represents underlying behavior is better than the alternative, regardless of how you feel about the system as a whole. And yes, contributors are to prey as deletionists are to predators, if you want to think of it that way.

Ed H. Chi said...

@HenkvD You're right that that the wikistat data says the figures are per DAY, not per MONTH. Sigh... Sorry for the confusion. From what I am told (can't analyze ourselves because WP dumps do not include deleted articles), many created articles do not survive long after being created (as much as 1/2 to 2/3 apparently are due to vandalism.) Survivability analysis would be very interesting to do at a more detailed level.

@Jon: I rather like to take your comments to mean that we haven't gotten to a good answer as to why the community seems to be more closed and that participation seems to have plateaued. Your earlier suggestions on doing more detailed and content-level analysis is where we hope to be, and will aspire to.

I hope we're convincing others that we are not doing pseudo-science here, but rather we're following the same methodology of others before us on understanding knowledge and information systems. We're not looking for a epiphany, but a model that can explain the evolution of these systems. This model will hopefully reduce conflict and enhance knowledge capture.

Jon Awbrey said...

Ed, James —

How does one get at a "model which represents underlying behavior"?

Right off the bat, I would suggest applying a modicum of qualitative anthropological-sociological field methods to get acquainted with the "underlying behavior" in the field — before you even think of twiddling the knobs on your favorite curve-fitting machine.

I just don't get that sense of realism that I should be getting from research that makes contact with its domain.

Another question that you ought to be asking on a continual basis is:

What are threats to the validity of our theoretical constructs?

I've pointed out numerous assumptions hiding in your population model that are glaringly dubious from what I know about Wikipedia, but I don't get the sense that you are addressing these obvious issues, much less actively searching for potential problems on your own.

Erik Zachte said...

Quote: "The data from wikistats is incomplete and curiously missing for the English Wikipedia, but yet the graph contains data for en.wikipedia. Will have to look into this more."

There has not been a full archive dump for English Wikipedia for years now. As a stop gap measure wikistats recently started to use stub-meta-history.xml.gz (which contains meta data for each revision but no article content) to produce at least some stats.

Thus growth trends in number of articles, edits and editors can be found, not in growth in word count, article size or number of links, etc.

http://stats.wikimedia.org/EN/TablesArticlesNewPerDay.htm does not use the stop gap metrics, but it could.

---

BTW wikistats so far totally ignores deleted articles, it makes no difference between articles that were deleted and articles that never existed. It only measures trends in number of articles that survived to this day.

For some community dynamics deletion stats are obviously relevant as Ed showed in earlier findings.

James Salsman said...

Jon, you're right; we need to extend from the traditional producer/consumer model of economic simulation and start modeling the reward-based motivations for editing. Are you suggesting that "Keep on following Wikipedia down the Goldbrick Road of Pseud-o-Scholarship and you may actually converge on some kind of e-piphany after all" represents numerous assumptions hiding in your population model that are glaringly dubious? I hope you will explain that, please. Sadly, qualitative anthropological-sociological field methods are not very available or forthcoming unless self-reported, and the community already recognizes the inaccuracy of self-reported information, so it would seem to be asking for trouble to rely on surveys alone.

In the mean time, I hope everyone will suggest more reasons that the number of new articles per day could be so volatile. Is there a wider zeitgeist in society correlated with how much people in general want to create new articles?

Ed H. Chi said...

@Jon: I will freely admit that we are not anthropologists nor ethnographers. And to get at the cultural questions of Wikipedia, numbers only present one sided view.

I agree that a hybrid methodology involving both the kind of quantitative as well as ethnographic (qualitative) and participatory field work will yield even better understanding.

Having said that, we conducted some 8-10 interview studies with Wikipedians in early 2008 to try and ascertain the reasons for the drop in contribution rate. During that study period, Diane Schiano (ex-PARC researcher with Stanford ties on communication and CMC research) transcribed and analyzed much of that data, but we never published the results from that ethnographic study, due to management pulling the resources out of that project. Some conclusions from that study that I remember is that many elite Wikipedians think of the principals and methods that evolve out of the practice in Wikipedia are very deeply culturally ingrained (almost like a religion). It had been a goal of mine to revive that study, but we haven't gotten the resources to do that.

In the mean time, I will point you to some good ethnographic research done by others on Wikipedia:

Bryant, Susan, Andrea Forte and Amy Bruckman. (2005). "Becoming Wikipedian: Transformation of participation in a collaborative online encyclopedia" In Proceedings of Group 2005: International Conference on Supporting Groupwork.

Forte, Andrea and Amy Bruckman. (2008). Scaling consensus: increasing decentralization in Wikipedia governance. In the Proceedings of Hawaiian International Conference of Systems Sciences (HICSS).


The Hidden Order of Wikipedia.
Fernanda B. Viégas, Martin Wattenberg, and Matthew M. McKeon. In HCII, 2007.


Reagle, J. (2008).In good faith: Wikipedia collaboration and the pursuit of the universal encyclopedia. PhD thesis, New York University, New York, NY. [ http://reagle.org/joseph/2008/03/dsrtn-in-good-faith ]

Also: http://reagle.org/joseph/2006/disp/proposal.html

Ed H. Chi said...

@Erik: Good to hear from you again. Our paper uses data from the stub-meta-history.xml dump from 2008-10. The new articles per day could indeed be generated from this file.

Your other stats on word count, article size, etc, would obviously depend on the full revision dump, which as you said is not available. This is really quite unfortunate, as it prevents us to get a good understanding of the better metrics that go beyond edit counts. Since you're on the Foundation, perhaps you can get them to put this at a higher level of priority?

Jon Awbrey said...

James,

I was referring to comments about population models that I made with regard to earlier reports in this series.

There is, for example, the dubious assumption of a 1-to-1 relation between accounts and persons, constantly being confounded under the construct "editors", when any reasonably experienced informant from the trenches of Wikipedia could tell you that the relation is many-to-many, and in ways that undermine presumptive stratifications into "elite" and "novice" participants.

If you are really dealing with "accounts that index many edits" then you need to say "accounts that index many edits" and stop assuming a 1 person equals 1 account model just because it's easier to think about. That course serves only to import rather naive assumptions about the persons and interest groups behind the accounts.

Jon Awbrey said...

Ed,

DGMW (Don't Get Me Wrong), I believe that you and your group have asked many of the right questions, duly noted by myself and others at The Wikipedia Review — well, back in the days when some us still bothered to needle a few threads on the Big Tapestry, for example, here.

And it's precisely the importance of those questions that makes us so impatient, not for answers right away — who could expect that? — but for a hint at least that some intrepid researchers somewhere, with the resources to do so, are pointed more in the direction of reality than some wiki-wishful fantasy.

Have you looked much into the literature streams on Learning Organizations (e.g. Peter Senge et al.)? — or Argyris and Schön on the difference between espoused and enacted goals (link at random)?

Ed H. Chi said...

Jon:

Thanks for sending over the links. I read the first link at Wikipedia Review with interest. I knew for some time that there were some good discussion over there about governance models, and had read some bits of it, but since Wikipedia research is mainly a side research project of my group, I haven't been able to get involved in the discussion much.

I am more familiar with learning theories relating to Schemata (Piaget's work and subsequent builds), constructivism and social constructivism, Active Learning, and theories of sensemaking from Russell, Card, Pirolli, and Stefik at PARC.

On probably more related work to the learning organization stuff you mentioned, I am more influenced by work by Herbert Simon (Nobel prize winner) on organizational theory. Stuart Card, my former mentor, studied with Alan Newell and Herbert Simon at CMU, so my influence come from that side of academia.

Jon Awbrey said...

Ed,

Piaget (Genetic Epistemology, Structuralism) was a big influence on my thinking in the 70s, blending into Minksy and Papert as I gradually shifted to AI, and I do remember a lot of curve-fitting and power laws in Newell and Simon, but also protocol analysis, the fons et origo of cyber-ethography, the very model of modern nitty-grittism, and a constant inspiration for my "exploratory analysis of sequential observations" all through the 80s.

¤ sigh ¤

good times …

Erik Zachte said...

Regarding full archive English dump
"Since you're on the Foundation, perhaps you can get them to put this at a higher level of priority?"

Ed, I wished I could. I have been pressing for this for 3+ years. Like Cato's "Ceterum censeo Carthaginem esse delendam", I can utter "there is one more thing..." and some tech people know what follows.

I hear you will visit WMF soon. I wished I could hear your talk. Please make your case for the broken dump as well.

If the answer is 'asap', you might ask if there still is a p at all in this equation. ;-)

Ed H. Chi said...

@Jon:
Yep, we have done quite a bit of protocol analysis, some of which is published and some are not. See for example, ref below[1]. Good times indeed, but we don't get to go to that level of detail very often these days.

I have appreciated your desire to get deep on the problem. I can't say that our research is as deep as protocol analysis or cognitive task analysis, but we're trying to get there. More importantly, I'm aware how rough our measures still are (esp. given sockpuppetry, and all kinds of noise in these systems). I wish someone would fund my group to spend all day looking at these problems.

@Eric:
I'm visiting the WM Foundation on Nov 23rd, after I get back from my vacation, so I will definitely bring up the issue again about the importance of getting full dumps. I heard from reports at WikiSym that BrionV is leaving as CTO, so this might put a monkey wrench in the mix. I don't know who is leading technologically there at this point. I guess we'll find out soon.

[1] Card, S. K., Pirolli, P., Van Der Wege, M., Morrison, J. B., Reeder, R. W., Schraedley, P. K., and Boshart, J. 2001. Information scent as a driver of Web behavior graphs: results of a protocol analysis method for Web usability. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Seattle, Washington, United States). CHI '01. ACM, New York, NY, 498-505. DOI= http://doi.acm.org/10.1145/365024.365331

Jon Awbrey said...

Hi Ed,

I linked to the ASC Blog in the discussions relating to this CyberLaw Course Project.

Jon Awbrey

P.S. You forget to put any relevant tags on this article.