I owe an apology to the R Project for Statistical Computing. In last week’s blog on R and Python, I conjectured that Python statistical learning functions may, in many cases, perform better than their R counterparts, noting that “I use the Python versions of gradient boosting and random forests on models of a half million cases without hesitation, while I can’t recall running an equivalent R model on N > 200,000 with impunity.” While the Python scikit-learn modules do appear to perform well, I hadn’t at the time re-calibrated the latest version of R on my new 16 GB RAM, high-end processor Wintel notebook. It turns out there’s quite a boost in performance from both hardware and software.
Once I finished my preliminary tests in Python, I turned looked at same data in R, starting with a sample of 100,000 cases from a data set that includes dependent variable wages, along with predictors age, sex and education. My usual suspect R learning models include linear regression with cubic splines, multivariate adaptive regression splines (MARS), general additive models (GAM), and gradient boosting. The first three generally run quite a bit faster than gradient boosting, which involves computationally-intensive resampling.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access