I quite enjoyed the recent O’Reilly Webcast “Python for Data Analysis” by Wes McKinney. Besides learning about the useful Python shell, IPython, I was introduced to the powerful package pandas, which sits on top of NumPy and matplotlib, adding a wealth of easily-accessible “munging” and data analysis capabilities for Python programmers. McKinney was an excellent presenter, and I can’t wait to get his book this week at Strata + Hadoop World Conference in New York City.
A collateral benefit of the Webcast was exposure to several “munge-friendly” data file folders on historical baby names from the U.S. Social Security Administration. For his analyses, McKinney used the National data, a series of annual U.S. birth name frequency files starting in 1880. The author first consolidated all data into a single CSV file with attributes year, name, sex and frequency – a total of 1,724,892 records. He then manipulated and analyzed that data using the combination of Python and pandas. I was struck by the similarities of this approach with the R Project for Statistical Computing techniques I routinely use to perform similar tasks.
Indeed, McKinney acknowledged he’s also a big proponent of R for analytics work. He cautioned, however, that R is memory bound with performance limitations against large data sets and thinks Python/pandas might be the better choice for meaty analyses.
Alas, McKinney’s right about R’s weaknesses. Not only must data to be analyzed fit into available memory, but R’s often very inefficient, maintaining multiple data copies in RAM. And its interpreted language can be sluggish for “big” computations, in many cases forcing package developers to resort to C or C++ to optimize performance.
On the other hand, even though it’s memory-bound and can be inefficient for in-memory storage, R will use additional RAM that’s available. On my Wintel notebook with 8G RAM, for example, I’ve processed a 10M+ record, 20 attribute data frame in R, though I can run random forest predictive models on only a sample from that data. Linux users are able to access scores of gigabytes for in-memory R storage and computations, allowing for sophisticated models on quite large data sets. All is not woe with R.
Beyond its physical limitations, R has a lot going for it as a language of choice for data munging/analysis. Its vector and object orientations promote compact, concise and easily extensible code. And the wealth of both core functionality and add-on package capabilities for reading, selecting, consolidating, aggregating, analyzing and visualizing data make R a very compelling choice for analysis.
Add in the consideration that munging/analysis/visualization often consists of quick, one-off exercises where the biggest cost is development, and a case can be made to simply load up a Linux box with RAM and let R rip for many data science challenges.
To determine if the R data munging strategy is practical, I decided to take on an analysis challenge with the larger, State-specific birth name frequency data set. I first downloaded the zip file to my PC, with all 50 individual state files in a single directory. The format of each was identical: state, sex, year, name and frequency attributes on a total of 5,365,794 records. My data analysis exercise was to determine, by sex and state, how many names comprise the top 25%, the top 50%, and the top 75% of all name frequencies for each decade. Were those numbers increasing over time?
It didn’t take long to develop an R script to answer those questions, perhaps two or three hours total. Getting the data into an R data frame was easy – little more than a loop through the individual data sets, appending each in turn. My big decision was which of the many “by group” R programming motifs to use to aggregate the data. One possibility was the “by” and “apply” family of functions that are part of core R. Add-on packages like doBy for calculating “groupwise summary statistics (much in the spirit of PROC SUMMARY of the SAS system)” and plyr, “tools for splitting, applying and combining data” could have done the trick as well. But I settled on the powerful data.table package for “fast access and grouping” that’s served me well on many occasions the last few years.
When I was done with development, the initial load of all data into R ran in a little less than two minutes. The remaining aggregation and analytics computations completed in under 20 seconds. It turns out that Americans are indeed using more birth names over time, and the increase is more pronounced for females than males.
Figure 1 below summarizes the findings for California.
The “decade” 2010 consists of the years 2010-2011 only. On the x axis is decade and on the y axis are the cumulative number of birth names that constitute the top 25%, 50%, and 75% of all names, respectively. Separate calculations are presented for male and female, with log frequencies used to visually even out the percentile distributions. Note the differences between the sexes. Perhaps the war decade of the 1940’s temporarily slowed the name growth?
The takeaway from this exercise? Don’t be too quick to dismiss R for data munging and analysis, especially if your computer has lots of memory. R’s powerful programming language, data management and summarization functions, along with superior analytics and visualization, make it a strong competitor for the munging and analysis challenges of data science.