A while back, I made a graph overlaying the senate party majority with the national debt. While correlation isn’t sufficient for causation, I did learn a lot about the shape of the national debt as well as assertions between political party and national debt is not as clear cut as political commenters and pundits would like you to believe.However, the entire experience of producing the graphs in those blog post left me dumbfounded with how tedious it was just to see the shape of the data.
First, finding the data was difficult. Many of the census bureau and government websites have confusingly bad user interfaces. It took me a while to figure out how to find the number of single females by year and state. Go ahead, try it.
Once you’ve found the data, you need shape it into what you need for the visualization, which is difficult because data on the web is in a sad state of affairs. Images, video, and text on the web are limited to a few popular formats. Not only that, the actual format is abstracted out with an image tag–and soon to be video tag–by the browser or flash. No widespread consensus exists for raw data, with the exception of a limited number of domains dictated by microformats, KML, and its ilk. Many publish data through html tables or CSV files, which comes in surprising variety of hard-parsable formats.
The whole thing was harder than it should have been, and if I couldn’t convey to you how mad I am, just insert expletives above, whenever possible.
As a programmer, I can figure out how to find and graph the data, however painfully. But it’s completely inaccessible to regular everyday person that uses the internet. There’s a lot of pubic data out there, but people aren’t able to access it easily. If it doesn’t show up on google’s search, that’s where they stop.
It’s not that people aren’t interested that data, but it’s because the data is completely inaccessible. If blog and news articles about unemployment, STD rates, and gas prices are any indication, people want to know this kind of information.
Even more so, people want to be able to explore different aspects of this data like the New York Times’ interactive visualizations by cutting and slicing the data as they see fit, because often times a statistic only makes sense when you can answer the question, “compared to what and how?” And people want to be able to explore it visually in a way that gives insight.
This is why I’m working on Graphbug–making public data more easily accessible through visualizations. Some questions are best answered visually.
We want to make it easy to find data that you need, and make it as fluid as possible to move between different datasets to compare them, in the same way that google maps made it fluid to navigate a map. Then if you need to do more hardcore analysis, you can download it and play with it however you like.
Of course, the topic is too broad at the moment. We’d like to focus on particular datasets first, and though we picked the US Census to start, we’d like to hear from you what sorts of datasets you’d be interested in? iPhone market share? Number of earthquakes in each state by year? The male to female ratio at colleges over the years?