Open Thoughts on Analytics
for Information Management Blogs
OCT 25, 2012 11:22am ET

Blogroll

blog

U.S. Baby Names: an Analytic Approach

Print
Reprints
Email

I quite enjoyed the recent O’Reilly Webcast “Python for Data Analysis” by Wes McKinney. Besides learning about the useful Python shell, IPython, I was introduced to the powerful package pandas, which sits on top of NumPy and matplotlib, adding a wealth of easily-accessible “munging” and data analysis capabilities for Python programmers. McKinney was an excellent presenter, and I can’t wait to get his book this week at Strata + Hadoop World Conference in New York City.

A collateral benefit of the Webcast was exposure to several “munge-friendly” data file folders on historical baby names from the U.S. Social Security Administration. For his analyses, McKinney used the National data, a series of annual U.S. birth name frequency files starting in 1880. The author first consolidated all data into a single CSV file with attributes year, name, sex and frequency – a total of 1,724,892 records. He then manipulated and analyzed that data using the combination of Python and pandas. I was struck by the similarities of this approach with the R Project for Statistical Computing techniques I routinely use to perform similar tasks.

Indeed, McKinney acknowledged he’s also a big proponent of R for analytics work. He cautioned, however, that R is memory bound with performance limitations against large data sets and thinks Python/pandas might be the better choice for meaty analyses.

Alas, McKinney’s right about R’s weaknesses. Not only must data to be analyzed fit into available memory, but R’s often very inefficient, maintaining multiple data copies in RAM. And its interpreted language can be sluggish for “big” computations, in many cases forcing package developers to resort to C or C++ to optimize performance.

On the other hand, even though it’s memory-bound and can be inefficient for in-memory storage, R will use additional RAM that’s available. On my Wintel notebook with 8G RAM, for example, I’ve processed a 10M+ record, 20 attribute data frame in R, though I can run random forest predictive models on only a sample from that data. Linux users are able to access scores of gigabytes for in-memory R storage and computations, allowing for sophisticated models on quite large data sets. All is not woe with R.

Beyond its physical limitations, R has a lot going for it as a language of choice for data munging/analysis. Its vector and object orientations promote compact, concise and easily extensible code. And the wealth of both core functionality and add-on package capabilities for reading, selecting, consolidating, aggregating, analyzing and visualizing data make R a very compelling choice for analysis.

Add in the consideration that munging/analysis/visualization often consists of quick, one-off exercises where the biggest cost is development, and a case can be made to simply load up a Linux box with RAM and let R rip for many data science challenges.

To determine if the R data munging strategy is practical, I decided to take on an analysis challenge with the larger, State-specific birth name frequency data set. I first downloaded the zip file to my PC, with all 50 individual state files in a single directory. The format of each was identical: state, sex, year, name and frequency attributes on a total of 5,365,794 records. My data analysis exercise was to determine, by sex and state, how many names comprise the top 25%, the top 50%, and the top 75% of all name frequencies for each decade. Were those numbers increasing over time?

It didn’t take long to develop an R script to answer those questions, perhaps two or three hours total. Getting the data into an R data frame was easy – little more than a loop through the individual data sets, appending each in turn. My big decision was which of the many “by group” R programming motifs to use to aggregate the data. One possibility was the “by” and “apply” family of functions that are part of core R. Add-on packages like doBy for calculating “groupwise summary statistics (much in the spirit of PROC SUMMARY of the SAS system)” and plyr, “tools for splitting, applying and combining data” could have done the trick as well. But I settled on the powerful data.table package for “fast access and grouping” that’s served me well on many occasions the last few years.

When I was done with development, the initial load of all data into R ran in a little less than two minutes. The remaining aggregation and analytics computations completed in under 20 seconds. It turns out that Americans are indeed using more birth names over time, and the increase is more pronounced for females than males.

Figure 1 below summarizes the findings for California.

The “decade” 2010 consists of the years 2010-2011 only. On the x axis is decade and on the y axis are the cumulative number of birth names that constitute the top 25%, 50%, and 75% of all names, respectively. Separate calculations are presented for male and female, with log frequencies used to visually even out the percentile distributions. Note the differences between the sexes. Perhaps the war decade of the 1940’s temporarily slowed the name growth?

The takeaway from this exercise? Don’t be too quick to dismiss R for data munging and analysis, especially if your computer has lots of memory. R’s powerful programming language, data management and summarization functions, along with superior analytics and visualization, make it a strong competitor for the munging and analysis challenges of data science.

Advertisement

Comments (1)
Steve, good article. One other open source technology to look at is HPCC Systems from LexisNexis, a data-intensive supercomputing platform for processing and solving big data analytical problems. Their open source Machine Learning Library and Matrix processing algorithms assist data scientists and developers with business intelligence and predictive analytics. Its integration with Hadoop, R and Pentaho extends further capabilities providing a complete solution for data ingestion, processing and delivery. In fact, executing HPCC Systems commands within R helps ease the burden of memory limitations with just R alone. More at http://hpccsystems.com
Posted by HAANA M | Thursday, November 01 2012 at 9:45AM ET
Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Blog Archive for Steve Miller

Tableau, Python and R
The Data and Bias of Macroeconomics
No Quick Death for Statistical Practices
Getting Started with Statistical Learning
The Big Data Revolution: Part 2

More from Steve Miller »

Blog Index »

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.