Wednesday, July 22, 2009

PART 1: The slowing growth of Wikipedia: some data, models, and explanations

In September of 2008, we blogged about a curious change in Wikipedia that we didn't know how to explain that we had known for a while, and the ASC group has been looking into understanding this change in the last 6-9 months or so. The change that we were curious about was that the growth rates of Wikipedia have slowed. We were not the only ones wondering about this change. The Economist (archived here), for example, wrote about it.

We are about to publish a paper in WikiSym 2009 on this topic, and I thought we should start to blog about what we found.


Monthly edits and identified revert activity

The conventional wisdom about many Web-related growth processes is that they're fundamentally exponential in nature. That is, if you want some fixed amount of time, the content size and number of participants will double. Indeed, prior research on Wikipedia has characterized the growth in content and editors as being fundamentally exponential in nature. Some have claimed that Wikipedia article growth is exponential because there is an exponential growth in the number of editors contributing to Wikipedia [1]. Current research show that Wikipedia growth rate has slowed, and has in fact plateaued (See figure at right). Since about March of 2007, the growth pattern is clearly not exponential. What has changed, and how should we modify our thinking about how Wikipedia works? Prior research had assumed Wikipedia works on a "edit begets edit" model (That is, a preferential attachment model where the more an article gets edits, the more likely it would receive more edits, and thus resulting in exponential growth [2].) Such a model does not preclude some ultimate limitation to growth, although at the time it was presented [2] there was an apparent trend of unconstrained article growth.


Monthly active editor - number of users who have edited at least once in that month


The number of active editors show exactly the same pattern. The 2nd figure on the right shows how since its peak in March 2007 (820,532), the number of monthly active editors in Wikipedia has been fluctuating between 650,000 and 810,000. This finding suggests that the conclusion in [1][2] may not be valid anymore. We have a different process going on in Wikipedia now.


Article growth per month in Wikipedia. Smoothed curves are growth rate predicted by logistic growth bounded at a maximum of 3, 3.5, and 4 million articles.

Some Wikipedians have modeled the recent data, and believe that a logistic model is a much better way to think about content growth. Figure here shows that article growth reached a peak in 2007-2008 and has been on the decline since then. This result is consistent with a growth processes that hits a constraint – for instance, due to resource limitations in systems. For example, microbes grown in culture will eventually stop duplicating when nutrients run out. Rather than exponential growth, such systems display logistic growth.

We will continue to blog about what we believe might be happening in the next few weeks, as we find time to summarize the results.

[1] Almeida, R.B.m, Mozafari, B., and Cho, J., On the evolution of Wikipedia. ICWSM 2007, Boulder, Co., 2007.
[2] Spinellis, D., and Panagiotis, L. The collaborative organizations of knowledge. Communications of the ACM, 51(8), 68-73, 2008.

Monday, July 20, 2009

Social attention and interactions are key to learning processes


I just finished reading a long article in the journal Science on how social factors are increasing recognized as extremely important in a new science on learning [1].

Learning is fundamentally a social activity, the article partially argued. "Social cues highlight what and when to learn." Meltzoff et al. summarize a whole slew of recent research that showed how young infants learn by imitation and copying others actions, and they build abstractions and models of others' behaviors. In fact,
"Children do not slavishly duplicate what they see but reenact a person’s goals and intentions. For example, suppose an adult tries to pull apart an object but his hand slips off the ends. Even at 18 months of age, infants can use the pattern of unsuccessful attempts to infer the unseen goal of another. They produce the goal that the adult was striving to achieve, not the unsuccessful attempts."


One point made in the article is how much the greater environment outside of school is becoming an important part of the ecology of learning.
"Elementary and secondary school educators are attempting to harness the intellectual curiosity and avid learning that occurs during natural social interaction. The emerging field of informal learning is based on the idea that informal settings are venues for a significant amount of childhood learning. Children spend nearly 80% of their waking hours outside of school. They learn at home; in community centers; in clubs; through the Internet; at museums, zoos, and aquariums; and through digital media and gaming."


Social learning, of course, is a major part of the social web. Wikipedia was designed to be an easy-to-use and freely available reference, and all of the social interactions offered by various online forums are rapidly becoming a part of the educational experience for secondary school pupils. I would argue, for example, that Wikipedia has done more for continuing education for all adult learners than any educational institution could have done by itself. ASC's research have purposefully been focused on learning and information access, instead of entertainment, because of our recognition of the importance of social factors in various kinds of learning.

As an example, social learning was explicitly part of the design of our SparTag.us prototype, which is now just being offered in limited beta software to Firefox users, was announced at the recent CHI2009 conference. It streams the annotations you make as you browse the web. The stream is collected into your notebook, and by default this stream of annotation is made available to anyone interested in it. This makes it possible to aggregate social attention later.


[1] Foundations for a New Science of Learning. A. N. Meltzoff, P. K. Kuhl, J. Movellan and T. J. Sejnowski. Science, 325 (5938), 284-288. [DOI: 10.1126/science.1175626].

[2] Photo: Alan Decker and the Machine Perception Lab, UC San Diego.

Friday, July 17, 2009

Historical Roots behind TagSearch and MrTaggy

Boing Boing recently covered our work on TagSearch algorithm, and the MrTaggy prototype. We built the prototype to show how ideas from search in the past are relevant in the new world of "social search".

Boing Boing published the following response after asking us about the historical roots of TagSearch algorithm and the MrTaggy UI:

Several pieces of earlier important PARC work inspired the TagSearch algorithm and MrTaggy's user interface and experience.

First, one of the most efficient ways of browsing and navigating toward a desired information space was illustrated by the pioneering research on Scatter/Gather, a collaborative project on large-scale document space navigation between amazing researchers such as Doug Cutting (of Lucene, Hadoop fame) and Jan Pedersen (chief scientist at AltaVista, Yahoo, Microsoft for search). The research done in early to mid 90s, showed how a textual clustering algorithm can be used to quickly divide up an information space (scatter step), ask the user to specify which subspaces they're interested in (gather step). By iterating over this process, one can very quickly narrow down to just the subset of information items they're interested in. Think of it as playing 20 questions with the computer.

Second, also around the mid-90s, an important information access theory was being developed at PARC in our research group called Information Foraging, which showed that you can mathematically model the way people seek information using the same ecological equations used to model how animals forage for food. We noticed that we can use information foraging ideas to model how people used Scatter/Gather to browse for information. It turns out that it was possible to predict how people use the information cues (which we called 'information scent') in each cluster to determine whether they were interested in the contents inside the cluster. It turns out that Scatter/Gather can be shown to be a very efficient way to communicate to the user the topic structure of a very large document collection. In other words, people learned the structure of the information space much more efficiently using Scatter/Gather interfaces.

I hope it is quite clear that the relevance feedback mechanisms are very much inspired by Scatter/Gather. The related tags communicate the topic structure of what's available in the collection. Through this process, we designed MrTaggy, hoping that it would be just as efficient as Scatter/Gather in communicating the topic structure of the space.

Third, our group had developed Information Scent algorithms and concepts to build real search and recommendation systems. These algorithms build upon earlier work on a human memory model called Spreading Activation. TagSearch algorithm uses similar concepts here. It constructs a kind of Bayesian modeling of the topic space using the tag co-occurrence patterns. TagSearch's algorithm owes its heart and soul in concepts in Spreading Activation, which helps us find documents that are related to certain tags, and vice versa.

Monday, July 13, 2009

Visualization used to improve team coordination


This past Thursday I spent some time at IBM Almaden research center to attend the NPUC conference, which focused on the future of software development. In computing, software development is one of the most energy intensive collaborations, and often requiring significant coordination. There are elements of competition thrown-in for good measure, and of course, everyone is working in the same workspace, which is often coordinated by version control software. Sounds quite like Wikipedia, doesn't it?

One of the interesting talks given at NPUC was Gina Venolia's talk on using visualizations to represent the structure of the code. This representation can be used individually to make sense of the system, as well as being used by a team to explain structure to others. As a map to the system, they help anchor conversations between developers by providing for an intermediate representation of the knowledge structure that they must share for effective coordination.

This is a fascinating area to think about how augmented social cognition ideas could provide for better tools for collaborative software development. For example:

* Each developer could get an color on the map. Overlaps between two developer can then be easily visualized to see areas where they need to coordinate in the past and in the future.

* Building up an understanding from the code of who is working with whom, annotations and comments made by one developer could be send over to another developer's map when code is checked into the system.

* Social analytic can be used to discover where developers are clashing with each other (like how we have discovered conflicts in Wikipedia).

Microsoft Research indeed has been thinking about awareness tools toward this direction. A project named FASTDash works to increase awareness between developers in software teams. Lots of exciting possibilities!