I decided to expedite my promise from a blog 2 weeks ago of “porting several soup-to-nuts R examples I’ve evolved over the years to a comparable 2014 Python environment”, to show the emerging similarity of Python and R platforms for data analysis/statistics. There’s more than a little motivation in the fear of being late to the game.
My first example draws on a medium-sized data set from U.S. Census Bureau’s Annual American Community Survey sample of households and individuals that provides a wealth of information on population demographics, income, education, residence, family characteristics, etc. over time. I’ve worked with versions of Public Use Microdata Sample (PUMS) for over 5 years, finding the data highly-informative and a more-than-toy technical test as well.
Both Python and R handle these challenges with aplomb. With Python, I first import several requisite libraries for task functions. In R, the supporting libraries are defaulted but I include “utils” for grins. There’s no exception handling in the examples.
The initial Python code looks as follows:
The corresponding R:
Once the data’s in place on my pc, it’s ready to be loaded into both Python and R structures for analysis. Each of the PUMS files consists of 227 comma-delimited attributes for a total of over 15.2M records. Of all the attributes, I’m interested in only seven for now, and wish to create “data frames” in R and in Python using its pandas package for data analysis.
Building the initial-cut pandas data frame is pretty straightforward, given the list of attributes of interest. R, surprisingly, is recalcitrant, without a simple means of subsetting the desired columns. After considering several accommodations, including a Python filtering script, I found a workaround on stackoverflow. R’s text file reader function supports a parameter that delineates the type of each column, a value of “NULL” indicating the column be skipped. So I do a little meta-data manipulation on the first hundred records and am ready to go. I finally change all attribute names to lower case to save my eyesight.
The Python code:
And the R version:
After the initial data are loaded, I subset the data frames to include only records for individuals 18 years or older with annual incomes in excess of $100. I then create new variables with categories specified using the “where” function from numpy and its “ifelse” analog in R. The functions are so similar I was able to maneuver between them with simple “replace all” edits. Finally I rename/drop columns to produce the final data frames consisting of 6 attributes (income, age, sex, race, education and marital status) and roughly 10.6M cases.
Note that the re-coded variables are of type string. By additionally invoking the factor function in R, the variables would change to the much-more-desirable factor data type, wherein string values are stored as numeric and can be flexibly reordered. Alas, pandas is just now starting to implement factors.
In the final step of 1A, I write the data to both csv files and binary storage, using Pickle for Python and standard rdata files for R. For 1B, I’ll be able to simply load the binary files to start the analyses.
The Python and R code in this admittedly simple example is remarkably similar. Indeed, driven by the numpy/pandas libraries, Python reads more like R than it does core Python.
Python/pandas is the performance winner with this exercise on my 16G Wintel notebook. In the early morning with plenty of wifi bandwidth, both download/unzips complete in about 10 minutes. The data frame creations, though, are much faster in Python-pandas than they are in R 2.5 minutes vs 14 minutes probably to some extent reflecting the R kluge for column selection.
Writing text/binary versions of the data frames and reading them back into memory are about the same for both: a minute to write the binary file, 30 seconds for the csv. Subsequent reads complete in 30 seconds or less.
A more efficient path for creating the ultimate 10.6M, 6 attribute R data frame with this data? Use Python to build a pandas data frame, then invoke its to_csv method to dispatch a csv file for subsequent read into R. Total post-download R elapsed time, about 3.5 minutes compared to the 14+ minutes in the all-R solution.
1B will contrast a few statistical models/graphs on this data with R-Python.