I met up with an old analytics colleague over the holidays. We worked together from the mid 90's to the early 00's. Our bond was the use of SAS as a foundation for much of the intelligence work we did at the time, including data programming, reporting and statistical analysis. Shortly after 2000, I started to migrate from SAS to first S+ and then R, while my friend remained steadfast in his loyalty to SAS.
Though we continue to be good friends and have the highest regard for each other’s work, we joust light-heartedly about our statistical platform choices whenever we meet. And we tend to hit all the marketing stereotypes: I decry SAS's dated language portfolio of data steps, procs and macros in contrast to object-oriented, array-centric R, while he counters that R's quirky and undocumented. He lambastes the quality risks of open source software while I rue the expense and oft-cited heavy-handedness of SAS business practice.
He boasts of SAS's current dominance in the statistical software market, and I counter the future belongs to the R platform that's now the choice of statistical academia worldwide. He notes the strides that SAS has made in usability while I tout having almost immediate access to the latest techniques developed by top practitioners and their students. And I cite R's superiority in graphics while he scoffs that in-memory processing limitations consign R to academic-sized problems.
I must admit the dig that R is useful for only toy applications and prototypes bugs me. Though I occasionally run into capacity problems doing statistical work on my PC, I've never experienced the draconian limitations with Wintel R that some naysayers claim. Indeed, since I started using the 64-bit R install – which can utilize over 80% of available RAM – on my 4 GB machine, I've maxed out of memory maybe half a dozen times. So I decided to test the limits of R on my machine by creating and working with some pretty large data.
Unable to immediately find an existing data set that met my size reqs, I decided to build my own from the 5-Percent Public Use Microdata Sample (PUMS) files of the U.S. Census. Using a Ruby program that processed over 5 million household and almost 14 million person records from the 2000 Census, I was able to assemble a 5.4M case, 14 attribute, comma-delimited file of wage earners. Attributes include age, sex, race, education, housing status, residence state, worker class, annual salary and annual income.
The resulting data file causes no problems for R per se, loading quickly into a data frame and consuming less than 300 megabytes of memory in the process. “By group” descriptive statistics and summary graphics derived from the entire data frame are fast and efficient, leaving me plenty of memory room to maneuver.
It's predictive models on the large data set, however, that cause problems. In fact, objects produced from R model functions such as “lm” and “gam” are often much bigger than the original data, since they include that data in addition to forecasts, residuals and other computations. Alas, it's become pretty clear to me that my machine as currently configured isn't consistently up to the task of building multi-variable models on 5+ million row data sets.
Not easily deterred, I made the accommodation of maintaining the basic descriptive “by group” processing against the entire 5.4M cases, while using a random sample of 1M records for the predictive models. This seems to work quite well, especially if I periodically re-cycle the R connection to release dangling memory.
For the entire duration of my R tests, I had a copy of the complete 5.4 million records loaded into a Tableau visualization session, the performance of its spiffy graphical operations very serviceable. And Tableau on a random sample of 2M records is downright speedy. Now if only Tableau and R could inter-operate, data scientists and science of business BI practitioners would be in analytical heaven …
All things considered, I'd say I'm pleased with the results of experiments with R on my 4 GB machine. Using the 64-bit build, I'm able to quickly load and manipulate pretty large data sets. And with a bit of programming prudence and planning, I can do just about all of my descriptive/graphical “by group” statistical tricks. I have to be smart with big data, though, as R's predictive modeling functions and packages consume memory ravenously. In those cases, random sampling is a friend of necessity.
In several weeks, I''ll say goodbye to 4 GB RAM and welcome a new Wintel 8 GB RAM notebook. I can't wait to measure the impact of the increased capacity on the size of problems I can address. I'll report back to IM readers when I make sense of it all.