Thursday, September 2, 2010

Open data manipulation and visualization: Challenges

I typically blog about research results here, but here is one post that's more conversational, and discussion oriented.  My good friend m.c. shraefel asked me a question via email: "What are 1 or 2 key priorities you think must be addressed that will aid citizen focused manipulation of open data sources for personal/social knowledge building?"

Here is my answer to her:

The issues you raised was precisely the inspiration for my Ph.D. Thesis work on creating a visualization spreadsheet.  From over 10 years ago, the idea was that if people can easily use spreadsheets, then they ought to be able to take that model further and start creating visualizations using them, and the thesis was an exploration to find out how to design such systems.  I think of ManyEyes, and Jeff Heer's later works to be in the same direction.

We have since learned a lot about user contributed content on systems like Wikipedia, Delicious, Twitter, and they show a very interesting participation architecture that consists of readers, contributors, and leaders.  Not all users want to be leaders, and not all users want to contribute.  We have sometimes use the derogatory term of "lurkers" to describe "readers", which I think is a bit unfair.  Ronald Burt's work have shown that a lot of us would like to be brokers of information among social groups, but there are also need for an audience, or followers, who might become brokers later, but not everyone all at once.

I believe that data manipulation of open data sources to follow the same curve.  Yes, some cancer patients will want to read all they can about their condition, and do the analytical work, and others (not necessarily because of tool limitations) would prefer to take a backseat, and let others curate the information for them.  What's interesting is that they might want very simple interactions that enable for basic sorting of data, or maybe even services that interpret the data for them (e.g. doctors), but they would prefer someone else does the bulk of the work (even if it becomes very easy, due to tool research and development).

Given that, what can we do?

First, it's quite clear that much of the hard work remains in data import and cleaning.  To democratize data analytics and manipulation, the bulk of the difficulty is dealing with data acquisition.  Unfortunately, most of this is engineering and not sexy research, so there aren't really innovative work in this area, but some information extraction (AI-style algorithms, and some machine learning techniques) are making some inroad in this area.  I also believe that mixed-initiative research for data import is sorely needed.  We're doing a bit of this work in my lab at the moment.

Second, there is the issue of data literacy. What kind of visualization works with what kind of data? What analytic technique is appropriate.  Early work by Jock Mackinlay (from our old UIR research group) pointed to the possibility of automating some of these design choices in his Ph.D. research, and we haven't made a huge amount of progress in this area since then.  He is now at Tableau software trying to solve some of these issues.  Wizards, try-visualization-refine loops have all been tried in research.  We need to stop inventing new visualizations, but actual usable tools for people here.  By going to vertical domains, we will learn how to solve this problem.

These two are the biggest problems, IMHO. Of course, there are other technical challenges such as data scale and compute power, security, privacy, and social sharing, which are all fascinating, and research such as ManyEyes have done a lot to teach us a few things about these issues.


Aaron said...

I'm not in a position to speak about data import and cleaning, although I'm sure you're right. I do, however, agree completely that supporting casual, but useful, interactions with data sets (which I think is part of what you're getting at) is a key challenge.

Some of the great stuff available through time series manipulations (sliders for the "current time", increasing/decreasing the frame size, etc.) and in geovisualizations (simple layering and filtering, contextual zoom to selected targets) make it easy for lay people to use the visualizations and even contribute meaningfully, even if for only their own edification.

I would say that, yes: we don't need more new visualizations, but that we need to instead move in the directions advocated by MacEachren (in geocollaboration): Support the simple manipulation of visualization frameworks by non-experts. We see this behaviour every day in map use, we just need to expand it to work in more visualization domains.

Ed H. Chi said...

Google Map has done a great job in helping create better map-based visualizations. But even then, not everything is easy. If I want a map of Europe, with major flight connections from North America highlighted, it's not simple. End-user visualization mashups like this are still just too difficult.

I want a visual analytics system that's like what spreadsheet did to numbers. It should be simple to cut and paste flight connections into, say, a table in the system, and then I would like to say, 4please visualize this on a map."

lgrammel said...

Thanks for this interesting blog post. I agree with you on those two challenges. I also see data integration as another major challenge in the long run. I believe that integrated data sets could facilitate having insights that are hard to see by analyzing at data sets in separation.

With regard to data import & cleaning, there has been interesting work done by David Huynh in the Gridworks project. It uses a combination of faceted browsing, visualization & spreadsheet elements to help the user clean a data set.

When it comes to supporting flexible visual data exploration, I found in an exploratory user study that tools need to suggest potentially useful visualizations, allow for iterative refinements and provide more learning support. I think that lowering the entry barrier as much as possible is very important, however, this might not be sufficient if people are reluctant to "invest time in open-ended, ill-defined tasks". I am currently working on Choosel, a web-based visualization environment that explores some of those ideas.

Ed H. Chi said...

@lgrammel: I hadn't heard about GridWorks, but know of David, so it was nice to learn about what he has been working on. That looks like a nice tool. I wish the big spreadsheet company (ahem! Microsoft) would work on new innovative features like that!

I think your points are correct that systems need to support data exploration better by suggesting operations and visual representations to try. More importantly, the exploration ought to be one in which the human and machine intelligence work hand-in-hand in a mixed initiative way. With the interface changing and suggesting new steps as the data analysis get more complex.