Sunday, January 4, 2009

Cloud Computing, Science2.0, and the Social Web

Start off 2009 with a more philosophical entry...

I was recently in Asia to give the keynote talk at the International Conference on Asia-Pacific Digital Libraries (in Bali, Indonesia!) In my recent travels and talks, I have been asked about the relationship between the latest buzz on "Cloud Computing" and Web2.0 (with its already-evident connections to service-based computing, social web, and social science).

Cloud computing trend might be best motivated by the understanding that data management and computational processing is moving away from personal computing frameworks into a collaborative workspace that is managed in the network. The impact is wide and deep. It's intertwined with service-based computing, Web2.0, and other trends.

The main value proposition is further "abstraction" that reduces management costs. For example, backup storage is abstracted into the cloud, so you don't have to worry about your hard disk failing. Computation is abstracted into the cloud, so you don't have to worry about not having enough computational nodes for your data analysis job. It is an inevitable trend in computing, because of the need to reduce complexity and data-management/computation-management costs. It's clear that, in the near future, the backup storage and computation will continue to evolve into collaborative workspaces that you never have to administer, nor would you have to worry about backing up your work.

Cloud computing has been touted as the second coming of computing science. That all science endeavors will now rely on cloud computation capabilities. Jim Gray, (the missing sailor, Turing Award winner, and the database guru), once said that the fourth paradigm of scientific discovery will involve "data-intensive explorations which unify theory, simulation, and experiment". I was asked what I thought of this new direction. Jim Gray is (was) a big figure in computing, so his opinion is certain worth its weight in gold. It's certainly one approach that would enable us to tackle bigger and more complex problems.

Jim Gray's fourth paradigm is rooted in his belief that data is at the heart of science -- essentially a kind of fundamental 'empiricism'. This kind of empiricism certainly has been at the heart of social experiments in Web2.0 applications. This viewpoint was argued by Shneiderman in the recent Science journal as being a kind of 'Science2.0'. The label '2.0' certainly has some relation to Web2.0 and cloud computing in that the same computational techniques being invented to handle social analytics and cloud computing are needed to do this new kind of empirical science.

The big bet is that big data sets will enable bigger science to be done (if you believe that all science derives fundamentally from observations.) I do worry that this viewpoint places too much faith is placed in blackbox science (i.e. input large data set into database, apply MapReduce or other parallelized machine-learning techniques, and then wham! Patterns emerge!) This seems to place too much faith on machine learning to do much of the heavy lifting. True scientific model building isn't just finding some parameters on some statistical algorithm. Science has more creativity than that.

From a practical perspective, the need for models and patterns for design is pressing, we certainly can't just rely on rationalism to generate all of the understanding needed to push forward. So Jim Gray's paradigm and other versions of Science2.0 are certainly part of the answer to really advance scientific understanding. Big-data-science has certainly been a huge propeller of advanced web analytics, enabling Google/Yahoo/Microsoft to be the big winners in computing. So investing in big-data-science is a 'no-brainer' in my book, but one needs to combine it with truly creative scientific work.

4 comments:

Anonymous said...

Thanks for having a critical point on blackbox-based machine learning methods. What about not-so-blackbox-based machine learning methods such as inductive logic programming and statistical relational learning?

Ed H. Chi said...

Emre,

Your question reminds me of the work on computational-based proofs that were controversial back in the mid-70s. Appel and Haken (1976?) at UIUC used computational logic inductive techniques to prove the 4-coloring problem, then a long-standing problem in graph theory. The machine produced proofs for each of the 1600+ cases, and then they were verified by hand by hundreds of mathematicians.

I think this case was a good example of how the not-so-blackbox techniques require careful and _creative_ application by scientists. There was no input-the-data-and-wham-get-results.

Science2.0 is going to require both data-intensive analytics as well as creative applications of those techniques by well-educated scientists and engineers.

Anonymous said...

I fully appreciate and respect "creative applications of those techniques by well-educated scientists and engineers". When I have asked my question the example in my mind was the famous protein folding discovery example in which a discovery published in a scientific journal was actually realized by an inductive logic program. So maybe we can say that augmenting scientific creativity and intelligence with very strong methods of logical reasoning tools should be one of the ways to go (of course I never underestimate the fact that even the starting point of choosing a proper representation for your machine learning algorithm (be it statistical or logical) may require lots of thinking and creativity).

Ed H. Chi said...

Emre,

There has been a lot of debate within my lab about the possibilities of creating hybrid systems in which symbolic logic reasoning systems can be combined with statistical machine learning. Many in NLP research believe this is the wave of the future, and that the best research will take the best parts of both approaches. I think your thinking in this direction is very much on the right track.