I owe an apology to the R Project for Statistical Computing. In last week’s blog on R and Python, I conjectured that Python statistical learning functions may, in many cases, perform better than their R counterparts, noting that “I use the Python versions of gradient boosting and random forests on models of a half million cases without hesitation, while I can’t recall running an equivalent R model on N > 200,000 with impunity.” While the Python scikit-learn modules do appear to perform well, I hadn’t at the time re-calibrated the latest version of R on my new 16 GB RAM, high-end processor Wintel notebook. It turns out there’s quite a boost in performance from both hardware and software.

Once I finished my preliminary tests in Python, I turned looked at same data in R, starting with a sample of 100,000 cases from a data set that includes dependent variable wages, along with predictors age, sex and education.  My usual suspect R learning models include linear regression with cubic splines, multivariate adaptive regression splines (MARS), general additive models (GAM), and gradient boosting. The first three generally run quite a bit faster than gradient boosting, which involves computationally-intensive resampling.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access