I owe an apology to the R Project for Statistical Computing. In last week’s blog on R and Python, I conjectured that Python statistical learning functions may, in many cases, perform better than their R counterparts, noting that “I use the Python versions of gradient boosting and random forests on models of a half million cases without hesitation, while I can’t recall running an equivalent R model on N > 200,000 with impunity.” While the Python scikit-learn modules do appear to perform well, I hadn’t at the time re-calibrated the latest version of R on my new 16 GB RAM, high-end processor Wintel notebook. It turns out there’s quite a boost in performance from both hardware and software.
Once I finished my preliminary tests in Python, I turned looked at same data in R, starting with a sample of 100,000 cases from a data set that includes dependent variable wages, along with predictors age, sex and education. My usual suspect R learning models include linear regression with cubic splines, multivariate adaptive regression splines (MARS), general additive models (GAM), and gradient boosting. The first three generally run quite a bit faster than gradient boosting, which involves computationally-intensive resampling.
I was pleased with the very speedy performance of all models at 100,000 cases, so I next sampled 200,000 records and pretty much saw a doubling of computer time to complete – still excellent. I then included all 500,000+ records in the data set and continued to experience linear-like performance. I was on a roll.
Next up was a similar dataset with a total of over 15M records, of which I sampled 2.5M – 2M to train and 500,000 to test. To my surprise, R again was up to the challenge, the linear regression, GAM and MARS models all completing in less than 25 seconds on the 2M record training data. The thirsty gradient boosting model, which could barely complete with 200,000 cases on my old computer with prior releases of R, finished in a respectable 5 minutes.
I was so delighted with the findings that I decided to use some of the model results for an R graphics presentation I’m developing. Fitting a gradient boosting model of logwage as a function of age, education level and sex to the training data, I then looked at model predictions against both the training and test partitions. Figure 1 shows the relationship between actual and predicted for both training and test data sets using the R hexbinplot. The red line represents a simple linear regression between actual and predicted; the blue “curve” depicts a cubic spline. That the relationships are quite visually similar between panels is a good sign, suggesting what’s good for training is also good for test.
Figure 2 depicts model predictions for all age, education and sex intersections that can occur in the data. R’s “trellis” features clearly show the curvilinear relationship between age and wages, the positive impact of education on wages, and the wage sex differential. I actually at first included a 4th feature variable, race, in the model and graphic, but later discarded it as relatively unimportant given the other features.
Having a large data set to partition into ample train-tune-test subsets can be a big advantage for modeling. With ample data, it’s less likely the model will be overfit – i.e. that the results from training data will be significantly more credulous than those from test. When data’s at a premium on the other hand, the cross validation method is generally deployed to combat overfitting. With CV, the data are first subdivided into, say, 10 exclusive, equal-sized, random partitions. Each of the 10 partitions, in turn, serves as test data for the remaining 9 combined for training. Final model parameters are generally some sort of average of the individual partition estimates. Alas, fitting 10, 90% models can be very expensive, especially when the computation involves extensive resampling.
R nay-sayers are indeed correct when they argue the platform’s full bounty is limited by available computer memory and processing power. But R can scale quite a bit when it has access to resources. R modelers should invest in hardware and not be afraid to take on larger problems, the combination of enhanced computer power and more efficient R releases raising the performance bar.