I recently participated in two of the four sessions of the just-completed webinar series: “The Evolution of Regression Modeling: from Classical Linear Regression to Modern Ensembles” from Salford Systems.
The odd-numbered sessions, which I sat through, provide the conceptual background for statistical learning from a regression perspective, in which an independent or response variable is predicted from a set of dependent, predictor or feature variables. Sessions two and four present illustrations of the methods using SPM, the Salford Predictive Modeler software suite. I plan to watch those videos soon and will detail my reactions in a future blog.
Salford Systems is a software and services company focused on supervised statistical learning. The name Salford itself is a shout out to Stanford and U.C. Berkeley, home of the two top-rated Statistics departments in the U.S. Much of Salford’s IP originates from the work of the late Leo Breiman of Berkeley and the much-alive Jerome Friedman of Stanford. Salford “combined their groundbreaking technological innovations with our own experience as practitioners of data analysis and predictive modeling.” CEO Dan Steinberg, a Ph.D. econometrician from Harvard, is himself a hands-on practitioner.
If you’re new to statistical learning, the first session with Mikhail Golovnya as moderator is an excellent introduction, departing from ordinary least squares (OLS) and progressing to regularized models, generalized path seeker (GPS) and multivariate adaptive regression splines (MARS).
Using the ubiquitous Boston Housing data, Golovnya demonstrates that overfitting is a serious problem with OLS: models that look good with training data are often not so compelling with test. One approach to redressing overfit is to “regularize” for balancing model complexity and performance on the test sample. The choice of regularization parameter determines how much the final model looks like OLS. Friedman’s GPS estimators dramatically expand the pool of candidate models, facilitating the selection of an optimal one for a given feature set size based on the test data.
Alas, as big an advance as regularization represents, the models are linear only and incapable of discerning local behavior in the data. MARS to the rescue, its splines flexible and adaptive to the shape of data. The MARS algorithm is a combination of a forward stage that produces “pieces” or basis functions, a backward stage that removes said pieces, and a selection stage that optimizes on test data. With Boston Housing, MARS won the prediction contest with its OLS and GPS cohort handily.
Session three on trees, ensembles and boosting, moderated by CEO Steinberg, is highly informative as well. Steinberg’s starting point is the non-parametric classification and regression tree (CART), the creation of both Breiman and Friedman. CART’s been a mainstay learning model for over 35 years, it’s strengths including automatic variable selection and implicit handling of missing values. CART performs between GPS and MARS with the BH data.
A step up from CART exploits the wisdom of crowds to build an ensemble of trees that averages predictions for regression and determines a majority rule for classification. A modification of this approach, popularized by Breiman, uses bootstrap aggregating, “bagging,” to build its disparate trees with random samples of the training data. The individual trees are then combined to produce predictions. Bagged CART handily beats the to-this-point competition for fidelity to test data.
Breiman’s Random Forests is bagging on steroids, choosing randomly not only the records to consider, but also the features or predictors. Random Forest lowers test prediction error even more than bagged CART. When a tuning parameter related to the number of available features is judiciously set, the errors are fewer still.
A different but nonetheless equally effective ensemble method is known as boosting. Boosting starts with a small prediction tree. It then “grows” a second tree to fit the residuals from the first. A third tree is modeled to the residuals of the second tree – and the process continues. Update factors from new trees are “shrunk” towards zero to minimize the risk of overfitting. As might be expected, the TreeNet boosting procedure of SPM produces better results with the Boston data than Random Forests, with the tuned TN the best results of all. Steinberg concludes his head-spinning talk with discussions of hybrid models and tests for feature interaction effects. I must admit that after the 60 minutes I had a headache – but a welcome one!
I like these conceptual sessions a lot and recommend the videos without hesitation for those looking to get started with statistical learning. And while the SPM software is specialized to a few models and is not a general purpose replacement for statistical packages R or SAS, my experience with the tool so far is that it’s easy to use and can readily handle large data sets. I suspect for those looking to get that extra 1% lift from highly-specialized learning models, SPM could be a sound investment.