I recently debriefed with an OpenBI consultant who’d just completed a challenging big data analytics assignment. In a two-month period, he exhausted a procession of Hadoop tools, including Hive, Pig, Pentaho Data Integration and finally Java MapReduce – in addition to SQL and R.
In the end, mission accomplished and many lessons learned. Two summary observations: the need for higher-level, integrated query/computation Hadoop tools, but also an appreciation of the low-level MapReduce pattern of computation.
Simplistically, with MapReduce, “Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.” Indeed, you can make the argument that MapReduce is essentially an illustration of what is called the Split-Apply-Combine metaphor for computation that’s popular in the statistical world.
With S-A-C, you first split your data by one or more grouping variables. You next perform a series of computations on each group independently. Finally, you reassemble the group computations into a unified whole.
Several years ago, R statistical computing luminary Hadley Wickham published an accessible paper introducing his new R S-A-C package, plyr. The paper provides many illustrations of common S-A-C applications in statistical computing, observing that we’ve been doing it for years – SQL group by, Excel Pivot tables and SAS “by group” processing being ready illustrations. Along with detailing plyr, Wickham outlines a strategy for designing S-A-C statistical programs:
- Identify the groups and computations. Pick one to prototype.
- Compute and confirm for the chosen group
- Encapsulate the computation in a function, and
- “Use the appropriate plyr function to split up the original data, apply the function to each piece and join the pieces back together.”
I like this bottom-up approach a lot; it fits well the way I approach analytics computing. It also turns out I’ve been doing S-A-C on stock portfolio returns data I’ve maintained for some time now.
My sources of data are the index postings of various Russell stock portfolios. Years ago I wrote a script to download daily, up-to-date historical values for 21 Russell indexes, ultimately producing a CSV file with the columns “portfolio name,” “date,” “index value without dividends” and “index value with dividends.”
With this “stacked” data grouped by portfolio, there are several computations I immediately make. First, for each portfolio, I create additional index with and without dividend columns normalized to starting value 1.0 for the first day. Second, I compute the daily percent changes in the indexes, beginning at day two. There’s one less percent change calculation per portfolio then there are index values. My “final” file then consists of eight columns – portfolio, date, four indexes and two percent changes and can be easily visualized/analyzed in Tableau and R.
When I first started working with the Russell data, I wrote traditional control-breaks python code for processing a data structure sorted by portfolio and date. The program works fine but seems obsolete today. Sorted-file, record-at-a-time, explicit-loops processing is so 1980’s.
My second attempt was using R statistical and data management software. R’s array processing is a naturel for S-A-C programming. Traditionalists work with the R “apply” family of functions to invoke computations on group data. Wickham’s “plyr” software/methodology provides an even more comprehensive S-A-C solution. My current preference is the data.table package that significantly enhances the capabilities of R’s ubiquitous data frame. Using a data table, I implement a 2-line function to compute the four new columns and let it rip with portfolio as the “by” group. Simple, fast and efficient.
The third approach uses python along with the numpy and pandas add-on packages that make python look a lot like R. Numpy arrays promote vectorized computations, while pandas replicates many R data frame capabilities. I struggled with the pandas family of grouping functions before deciphering “transform” to solve my problem. The two libraries perform very well.
A final hybrid solution involves both python and R, “pushing” python data to R for S-A-C computation and then “pulling” the output back to python. Using the pypeR python library, a numpy or pandas data structure can be assigned to an R variable, have an R function applied to it, and then be returned to python. Even if it feels like cheating, the capability of seamlessly moving data between python and R with access to the best S-A-C computations of each is pretty cool.
The Split-Apply-Combine metaphor for solving statistical problems is both important and pervasive in statistical computing and should be part of a core analytics programming arsenal. Fortunately, there’re a growing number of excellent S-A-C tools available for solving statistical problems.