By-Group Processing, the R data.table and the Power of Open Source
I just finished developing a couple of statistical analysis R scripts/functions for a customer. The work was pretty mundane: read several text files, merge the results, reshape the intermediate data, calculate some new variables, take care of missing values, attend to meta data, execute a few predictive models and graph the results.
Then repeat the models and graphs for groups or sub-populations marked by distinct values of one or more dimension variables of interest. The latter step is commonly referred to as “by-group processing.”
SAS programmers will recognize by group processing with syntax that invokes a procedure on a sorted data set that looks something like:
proc reg data = dblahblah; by vblahblah;
In R, it’s all about functions and objects. So analysts invoke the “by,” “aggregate” or one of the “apply” family of functions. Within these by functions, another procedure that processes data for each group is called, coordinating very powerful computations.
It took me a while to get comfortable with the R approach. Not only are the functions somewhat arcane, but often the data structure returned, while holding all pertinent calculations, must be messaged into something easier to manipulate.
By-group processing is so prevalent it’s often the topic of questions to the R help lists. At least once every few weeks, a query on how to handle it surfaces. And the community generally responds dutifully, except if the request is ill-specified, in which case the questioner is torched.
A recent thread started with the basic question of how to code a specific by group task in R, followed by a half dozen well-articulated responses that included sample code fragments. One terse note, however, simply asked if the questioner had considered the data.table package to complete his task. I hadn’t heard of data.table and so decided to install it on my R instance and take a look at the help page.
Once I’d read the available vignettes and worked through the simple examples, I was intrigued enough to test drive data.table with a real challenge. So I took one of the completed scripts and set out to translate the by group code to the data.table-structure. After a few programming iterations, I was able to replicate the script’s functionality with more compact and readable code.
The data structures that emerged were simpler and easier to manipulate than those they replaced. And while the original script ran quickly, the new version was even faster, a consequence of the strategic use of optimizing C code and more efficient search structures by the author. Indeed, I was quite gratified with performance of data.table on even complicated calculations from a 6M record data set. All this and I’d explored less than 50 percent of package functionality. I was sold. data.table will be a staple of my R tool chest going forward.
The point of this ramble? One of the major strengths of open source projects like R is the significant contributions of unpaid users. There are now well over 2,000 packages written by the world-wide R community freely-available for download. And the functionality they provide often laps that of commercial competition.
As an illustration, I’m a big proponent of the statistical learning work espoused by Trevor Hastie, Rob Tibshirani and Jerome Friedman in their classic “Elements of Statistical Learning.” Most of the procedures presented in ESL are complimented by packages written in R, often developed by the authors themselves or their Stanford students. And all are freely available. You get the latest techniques written by the very experts who developed them far in advance of the commercial competition – for free.
It’s comforting to know that the statistical package I embrace has many top worldwide practitioners as contributors. It’s also heartening that the platform will continue to expand as volunteer developers add freely-usable features that make statistical programming ever simpler and more efficient.