I've been spending quite a bit of time lately working with the data.table package in R. data.table's functionality builds on R's ubiquitous data.frame to provide, according to lead developer Matt Dowle, “Fast aggregation of large data (e.g. 100 GB in RAM)......fast subset, fast grouping, fast update, fast ordered joins and list columns...and a fast file reader (fread).... in a short and flexible syntax , for faster development”.
I'm a big fan and have been since the early days, watching data.table's functionality increase dramatically over time. Four years ago, I adopted data.table as a more comprehensible “split-apply-combine” programming metaphor than R's arcane “apply” family of functions. Over time, as both the capabilities and my understanding of data.able have progressed, the package has become central to more and more of my data management work in R. And if my R network is an unbiased sample of the community, data.table's enjoying rapidly growing success.
Last week I presented on data.table to an all-hands gathering of my company, Inquidia Consulting. Using a 9M+, 25 attribute, 1.7 GB health care data set, I put data.table through it's data management paces loading, updating, sorting, selecting, projecting, grouping and aggregating data. On my PC, the initial data load comprised about 45 seconds, while every other demonstration step completed in under 5 seconds, most in a second or two.
My approach to explaining data.table to the class was to link its operations to SQL that everyone knows well already: data.table's i, j and “by” are SQL's where, select and group by. Indeed, about three quarters of the way through my presentation, one wise guy blurted “all well and good, but I can do everything you've shown in SQL!”. His timing couldn't have been better, since my final examples showed sophisticated statistical computations and graphics in the “by” functions. Take that SQL.
As I was preparing for the class, I revisited some data.table work I blogged on last winter. Lo and behold, I was able to lop off another minute, 33% of the data creation time, with just the understanding I'd gained since then. I also found what appeared to be a bug in data.table's speedy fread function. Rather than go through the R bug reporting process like I should have, however, I emailed Dowle directly. After a few quick go-rounds, he determined the problem was related to dreaded Windows real stats guys use Linux fixed it, and uploaded the revision to CRAN for my update. Not only that, but he offered me gratis access to the online training materials he and co-developer Arun Srinivasan have assembled. His thinking was, better the largess than my persistent questions, I'm sure.
There's no shortage of excellent data.table documentation available for free, but for those looking to turbo-charge their learning, the $95 for the Dowle/Srinivasan video training is a good investment. I generally find it advantageous to address multiple sets of learning materials simultaneously, working through the toy examples while attempting to extrapolate the new techniques to a meaningful data set, in this case the 9M record health care data.
The training video is divided into three lessons, Novice, Yeoman and Expert. Each lesson consists of example-based lectures followed by exercises. In total, there's about 30 minutes of instructor material followed by a welter of online questions. The material can be completed in 3-4 hour by someone familiar with basic R programming.
Not surprisingly, there was learning for me in all sections, even Novice, where I discovered syntax options I hadn't seen before. It's also handy to know that “by” variables can be computations on existing attributes. One particularly nice feature of data.table well covered in the training is “chaining”, in which variables created in one section of code can be referenced in the next all in a single statement.
Since Dowle's day job is in financial services where time series data rules, Expert revolved on the means to handle such data in data.table. Much of that material was new to me, and I while I won't be using it every day, I'm sure the techniques will find their way into my scripts in time.
I just took possession of a new notebook with 32 GB RAM, so I'm now in a position to put data.table through paces on some pretty hefty data. I've worked with a 15 GB data.table so far and have plans for 25 GB soon. The performance for load, select, group and apply have been quite gratifying. It's certainly nice to know I can maneuver around 100 M records with 50 attributes at memory speeds on a notebook. I look forward to the continued development of data.table as a go-to analytics solution in my data science tool chest.