For the second consecutive year, I was unable to get away this holiday season. So, instead of the zaniness of Duval Street in Key West, I had to content myself with discovering some of Chicagoland’s great BYOB restaurants, watching the NFL and college sports on TV, and playing on my trusty computer.
The tech challenge I gave myself this year was finding/assembling a decent-sized data set and putting it through some data management and statistical paces. Fortunately, it didn’t take me long to latch on to the U.S. Census, American Community Survey, 2006-2010, 5-year Public Use Microdata Sample (PUMS), available in CSV and SAS file formats. I downloaded both 2 GB compressed files and started my data journey.
First up was the SAS data for which I dusted off the trusty WPS from World Programming, a low-cost clone to the ubiquitous, if pricey, SAS platform for statistical analysis. The latest 64 bit version of WPS was able to recognize and process the downloaded SAS format data set without a hitch on my 8 GB RAM Wintel notebook. Using “proc contents” and “proc sql,” I determined I was working with more than 15 M records and 227 attributes. “proc freq” and “proc univariate” then helped me decide which variables were worthy of further consideration. Eliminating the less interesting attributes and creating half a dozen new ones, I wrote a second, “skinny” PUMS SAS file with just 30 attributes.
Once that data was in place, I computed quantiles on the earnings and income variables by education as an illustration of basic statistical processing. For good measure, I tested the WPS ODBC connectivity by using “proc freq” on the columns of a MySQL table. Performance on these basic tasks was outstanding. Based on this and other experiences, I’d certainly recommend WPS to those committed to SAS data step programming.
On the CSV side, I first developed Python programs to “munge” and “wrangle” the data into a usable file format with attribute names in row one followed by delimited data in subsequent records. The Anaconda Python environment available from Continuum Analytics, which includes the latest interpreter plus many core libraries such as numpy, matplotlib, scipy and pandas, along with the wonderful interactive ipython programming environment, made that work easy.
The Python program that assembled the final CSV file of just over 15 M records and the 15 selected attributes ran in about five and a half minutes. Loading the CSV data into an R data frame took two and a half minutes, while the save of the slightly less than one GB R structure to disk finished in just under two minutes.
I suppose I should consider myself fortunate to have created such a “large N” data frame, given that R must consume its objects in RAM and is generally inefficient with memory allocation. I can’t perform predictive models on all 15 M records, but can efficiently create random samples of size, say, 100,000 for subsequent modeling work. And, using the snappy community-developed package “data.table”, I can do quite sophisticated “by group” processing on the larger data. For example, computing annual earnings quantiles of those 18 years old or more by race and education for the 15 M data frame is simple and fast (< 10 seconds). Once data are aggregated in the “by groups” tables, R’s trellis graphics capabilities with packages lattice and ggplot2 sparkle.
My disappointment with this year’s exercise was being unable to load the 15 M CSV file into the pandas’ Python programming environment. I’ve been playing with pandas ever since I bought the “Python for Data Analysis” book by Wes McKinney several months ago, and like the library a lot.
Pandas attempts to mimic the functionality of R in the Python environment. As is the case with R, though, pandas is memory-constrained, and I’d only used it on data sets of 1.5 M records or less. Alas, despite trying several approaches, I was unable to read the entire PUMS file into pandas’ version of the R data frame. After 13 M records, the job ran out of memory, Windows Task Manager showing the process slamming the RAM wall. I’ll continue to search for a solution.
Though I enjoyed the geeky week, I hope not to write a techie holiday blog next year, an indication that I was taking in the sun somewhere a lot warmer than Chicago. But if forced to stay put I’m sure I’ll be at it again – if lucky, learning as much as I did in 2012.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access