Paul Bradshaw

Too big for Excel? What to do with big datasets

In Uncategorized on February 12, 2013 at 10:00 pm

Recently the NICAR mailing list (for journalists who use computer assisted reporting) discussed how they dealt with datasets that were ‘too big for Excel’. With their permission, I’m reproducing a digest of the highlights.

How much is too much

Different versions of Excel have different limits to the data they can handle. From a million rows in Excel 2010 to just 16,000 rows by 256 columns in Excel 5, Office Watch gives a good rundown of the various versions.

Tom Torok points out that Excel 2007’s million row limit is per sheet, rather than per workbook (spreadsheet), so if you have more than 1 million rows you could spread them across mulitple sheets. 

Peter Aldhous recommends the Power Pivot plugin for Excel 2010 – here’s what it does. Most relevant is that it has “the ability to handle up to 100 million rows” in a workbook.

Stephen Doig suggests that if the data was only going to be used as a large directory, you could “save it as a text file and search it with a decent text editor”. And Dan Nguyen provides a list of useful text editors here.

Query languages and servers

Ed Borasky notes that Excel has a query language built in. “Put the whole file in Access or SQL Server’s free version and teach the users how to query a database in their spreadsheets.” And Tim Henderson adds: “You’ve hit on the perfect reason to learn online programming. Get a $7/month MySQL server account, a book on PHP or Python or whatever and you’ll have it up and running in a week I bet.” – although Borasky disagreed that you needed a dedicated server: “Something that small you can build in a virtual machine or even run natively in Windows.

Another option is PANDA, “a tool to host large datasets locally (or on an EC2) so everyone in your news room can search them” (Brian Abelson, seconded by Nolan Hicks, who suggested this setup process).

Margie Roswell tried that out, and came up with the following list of additional tips:

  • You need to get Java working. For some reason, it wouldn’t work for me in Firefox. In Chrome. I had to go into chrome://plugins/ and check to enable Java(TM) to be “always allowed.” (not sure how secure that is… but nothing would work until I did that.)
  • If you’ve already set up amazon AWS EC2 once before, and don’t have a record of your keypair, you need to create a new one.
  • You need to know the URL
  • Even though they provide the default username of “root,”  they tell you that the user name needs to be “ubuntu” 

 

…and these helpful blog posts: 

 

SQL plugins

I also added a mention of the SQLite plugin for Firefox, which turns your browser into a database. But Brian Bowling tried it out with a big dataset and crashed it:

The import of the first county, Adams, worked, but it crashed with an JavaScript out of memory error when I tried to import the next county, Allegheny, which is one of the largest counties in the state. While SQLite can handle larger databases, it doesn’t look like the plugin can.”

Back it up!

And finally a word of warning from Kate Martin: “Whatever you do, save the original file in read-only format. The last thing you want is someone accidentally changing something important.

 

Advertisements
  1. I’ve been playing around with R and R Studio (as part of a course on data analytics from Coursera) over the last few weeks and while it’s a bit of a chore to learn, I reckon it could be a really useful tool for journalists – especially when it comes to looking for relationships between different data variables that Excel isn’t so good at revealing.

  2. That’s a great observation Adam! The other really cool thing about R which is particularly helpful for journos starting on the dark path of data analysis is that it’s free and comes with a great community of people ready to help one get started. To complement your comment here is a link to learning more about R http://www.r-project.org/ and one for the Coursera Data Analysis course running at the moment (note that they also have a series of videos on getting started with R from a previous course they ran – these are available on your lecture videos list at the very bottom) https://class.coursera.org/dataanalysis-001/class/index Hope this helps 🙂

  3. Thanks for putting the links in Teodora – I really should have done that! http://www.r-bloggers.com is also really good, and a lot of students on the Coursera are recommending http://tryr.codeschool.com/ too.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: