Monday, January 26, 2009

Governing and authorship models at Wikipedia and Britannica

Elsewhere, we have spoke about the complex and interesting governing and authorship model in Wikipedia. How counter-intuitive is it that a model like "anyone can edit anything they want" could produce a useful information resource?!

We have conducted some characterizations of the social dynamics within this community, and tracked its changes over time. Interestingly, in the last few days, both Wikipedia and Britannica have been in the news for debates on their stance of the authorship and editorial model.

First, on Jan 24th, we learned from the BBC that the president of Britannica wrote a blog entry in which he outlined a new plan at Britannica to enable readers as well as more experts and editors to help expand and maintain the articles. While not naming Wikipedia by name, it was a clear nod toward a more collaborative relationship Britannica will have with its readers. Specifically, in the blog entry, Jorge Cauz says that "We believe that the creation and documentation of knowledge is a collaborative process but not a democratic one." Most would agree that, in the past, the collaborative process that Britannica had was much more restrictive, and now they seem to have decided to open the door wider to include more people in the editorial process.

Then, today, we learned, also from BBC, that Jimmy Wales have caused a huge stir at Wikipedia for suggested a more restrictive approach to the editing process. He now believes that Wikipedia should follow a model in which edits from anonymous users have to be vetted by one of the site's editors before becoming live.

Apparently, the heated debate is now spreading, and is being mentioned as a big news item on the Yahoo! front page after being written up by AFP. So here we have a system that has been extremely liberal with its editorial policy moving toward a more restrictive authorship model.

So what gives? Is there a right way or wrong way to constructing and compiling knowledge resources? As designers of social systems, what should be the governance model for these systems?

For one thing, we still know awfully little about the social dynamics in these large social systems. We have been quoted in the past that our characterization models of editors show that the top 1% of the editors in Wikipedia generates 50% of the edits. While that is true, the other 50% is being generated by the other 99% of the editors. This other 50% is just as important as the first 50%!

We have been recently conducting some additional research to understand class structures in Wikipedia. We already know that the distribution of editors and their frequency of edits in Wikipedia is a classic power law curve. In order to understand editors through out this distribution, we first ranked editors by their edit frequency, and then divided all of the edits into four quarters, according to this sort.

For one month worth of edit data, there are about 220 editors that are at the very top of the pyramid. These top (most frequent) editors produce the first quarter (25%) of the edits. The next 25% of the edits come from about 1000 editors. While the 3rd quarter of edits come from about 4000 editors, and the last quarter comes from about 15000 editors.

So now the research question is whether you want to design your editing policy to favor the upper class (top editors and administrators), the middle class (the 5000-6000 editors who contribute the middle 50% of all edits), or the lower class (the 15000 editors who contribute the last 25%).

One way to think about this problem is to study the amount of resistance each of these four classes of editors experience on Wikipedia. A metric that we used is the reverts-to-edits ratio. That is, on average, what percentage of edits were reverted, as experienced by each of these four classes of editors? Turns out that the reverts-to-edits ratio for each of these 4 classes of editors were 1.3%, 1.4%, 1.5%, and 4.7%, respectively. Meaning that the lower class of editors clearly experience greater resistance, such that, on average, 1 out of every 20 edits they contribute are reverted. Moreover, the resistance they have experienced have generally increased over time (from about 3% in early 2006 to 5-6% in 2007-2008, and back down to around 5% in late 2008).

So, even without the "flagged revision" mechanism such as the ones suggested by Jimmy Wales, it has already been getting harder for the lowest class of occasional editors to produce edits that remain as contribution in Wikipedia.

The AFP article points to the fact that the debate over the policy came about because of vandalism on Ted Kennedy's page, which had falsely suggested he died after suffering a collapse at a lunchon during Obama's inauguration. But apparently this was corrected within minutes, suggesting that the current system is still correcting most mistakes quite rapidly. Moreover, after I did some sleuthing in the editing history, it appears that the original vandalism edit was done by a registered user named "Gfdjklsdgiojksdkf", and not an anonymous user.

So, it is unclear to me that the current system is not working. Are we fixing something that isn't broken (at least not yet)?

Friday, January 23, 2009

Activities, Workflows, and Structured Wikis

Gregorio Convertino, who recently joined ASC research area here at PARC, have been looking at how Web2.0 tools like Wikis support workflows within the enterprise. By workflow, we mean activities that are important enough to be documented in the enterprise (either because it is an important client, or that it is an activity that is often repeated.)

For this purpose, we have been doing an overall review of structured Wikis available in the marketplace (either thru open-source, hosted solution, or supported-installation). By "Structured Wiki", we mean wiki engines that are enhanced with lightweight programming features and database functionalities. The focus of our review is primarily, but on only, on the user interface and interesting new functionalities to organize content such as templating and database functions. Important criteria for us are ease of use, power of end-user-programming/organizing functionalities, and licensing.

Over the past few weeks, we have been looking for useful resources on Content Management Systems and Wikis that support structured activity management. On the Wiki side, we have found to be one of the best guides in understanding this space.

Exemplars of wikis enabling some structure are, for example, TWiki, Xwiki, TikiWiki CMS-Groupware, JSPwiki, MediaWiki, Openrecord (not on Wikimatrix's list). For a comparison matrix, click here. In this example comparison, all of the systems support Page Templates. It's clear that many people are looking for these kinds of functionalities, and we have found some discussion around this on the net. Twiki developers seems to have documented some of their thinking.

But the depth of the discussion so far isn't very deep, because we don't really seem to know yet how much structure is too much structure, and how different enterprise needs are met by each of these solutions. In our work, we are finding it quite difficult to figure out what an Enterprise should implement:

(1) There are so many different flavors of Wikis out there, and they don't always inter-operate well. Choosing one appears to mean that you're stuck with it forever.

(2) Research on end-user templates has not been focused. We have found references in academic literature, but they are pretty sparse. Here is what we have collected so far:

  • There was the work of Sparrow at PARC.

  • Di Iorio A. Vitali S., Zacchironiet S. Wiki Content Templating (WWW 2008).

  • Anslow C. and Riehle D. 2008. Towards end-user programming with wikis. Proceedings of the 4th international workshop on End-user software engineering

  • Riehle D. 2008. End-User Programming with Application Wikis. In Proceedings of the 2008 International Symposium on Wikis (WikiSym ‘08).

  • Haake et al. Wiki-Templates. WikiSims 2005

  • Reinhold. WikiTrails: building context and structure around the content and existing information organization, using trails, or paths, through the Wiki content. SigWeb 2006.

  • Jochen Rode. 2005. Web Application Development by Nonprogrammers: User-Centered Design of an End-User Web Development Tool. Ph.D. dissertation. Virginia Tech. Click system.

  • Also, there was of course the concentrated effort on activity-centric computing at IBM, as well as the work done on Co-Scripter (workflows that can be collaboratively built using a wiki-model).

  • Anyone who can help us understand this area, please get in touch!

    Sunday, January 4, 2009

    Cloud Computing, Science2.0, and the Social Web

    Start off 2009 with a more philosophical entry...

    I was recently in Asia to give the keynote talk at the International Conference on Asia-Pacific Digital Libraries (in Bali, Indonesia!) In my recent travels and talks, I have been asked about the relationship between the latest buzz on "Cloud Computing" and Web2.0 (with its already-evident connections to service-based computing, social web, and social science).

    Cloud computing trend might be best motivated by the understanding that data management and computational processing is moving away from personal computing frameworks into a collaborative workspace that is managed in the network. The impact is wide and deep. It's intertwined with service-based computing, Web2.0, and other trends.

    The main value proposition is further "abstraction" that reduces management costs. For example, backup storage is abstracted into the cloud, so you don't have to worry about your hard disk failing. Computation is abstracted into the cloud, so you don't have to worry about not having enough computational nodes for your data analysis job. It is an inevitable trend in computing, because of the need to reduce complexity and data-management/computation-management costs. It's clear that, in the near future, the backup storage and computation will continue to evolve into collaborative workspaces that you never have to administer, nor would you have to worry about backing up your work.

    Cloud computing has been touted as the second coming of computing science. That all science endeavors will now rely on cloud computation capabilities. Jim Gray, (the missing sailor, Turing Award winner, and the database guru), once said that the fourth paradigm of scientific discovery will involve "data-intensive explorations which unify theory, simulation, and experiment". I was asked what I thought of this new direction. Jim Gray is (was) a big figure in computing, so his opinion is certain worth its weight in gold. It's certainly one approach that would enable us to tackle bigger and more complex problems.

    Jim Gray's fourth paradigm is rooted in his belief that data is at the heart of science -- essentially a kind of fundamental 'empiricism'. This kind of empiricism certainly has been at the heart of social experiments in Web2.0 applications. This viewpoint was argued by Shneiderman in the recent Science journal as being a kind of 'Science2.0'. The label '2.0' certainly has some relation to Web2.0 and cloud computing in that the same computational techniques being invented to handle social analytics and cloud computing are needed to do this new kind of empirical science.

    The big bet is that big data sets will enable bigger science to be done (if you believe that all science derives fundamentally from observations.) I do worry that this viewpoint places too much faith is placed in blackbox science (i.e. input large data set into database, apply MapReduce or other parallelized machine-learning techniques, and then wham! Patterns emerge!) This seems to place too much faith on machine learning to do much of the heavy lifting. True scientific model building isn't just finding some parameters on some statistical algorithm. Science has more creativity than that.

    From a practical perspective, the need for models and patterns for design is pressing, we certainly can't just rely on rationalism to generate all of the understanding needed to push forward. So Jim Gray's paradigm and other versions of Science2.0 are certainly part of the answer to really advance scientific understanding. Big-data-science has certainly been a huge propeller of advanced web analytics, enabling Google/Yahoo/Microsoft to be the big winners in computing. So investing in big-data-science is a 'no-brainer' in my book, but one needs to combine it with truly creative scientific work.