Sitting through the seemingly inevitable three hour delay at Logan does that to me. But, truth be told, I was tired even in the taxi ride to the airport following completion of the intense two day seminar Statistical Learning and Data Mining III, taught by Stanford Statistics professors Trevor Hastie and Robert Tibshirani. I had sprinted with the instructors on the fast-paced seminar and was proud I kept up till almost the very end. Exhausting.
Three years ago, I took SLDM II and decided to consider the course again when II became III. I was a bit hesitant to sign up, though, until an email correspondence with Hastie offered assurance of no more than 50% overlap of material. I'm now glad I made the decision to take the latest version and sign up early. All 70 classroom seats were taken.
The opening lecture set the pace, with Hastie providing a nice perspective on statistical learning as well as a 10,000 foot introduction to many of the methods to be covered. The focus of this discussion was supervised learning that relates a response or outcome measure to a set of features or inputs, with N cases and p features. The outcome variable can either be continuous (regression) or categorical (classification). Hastie addressed different models including nearest neighbors, kernel smoothing, linear models, additive models, ensembles and support vector machines.
A recurring theme of the two days is the central SL problem of overfitting training data that generally leads to poor future predictions. As remedies, Hastie proposed input selection where the number of independent variables is limited by an established criterion such as best subset, and regularization, where all inputs remain in the model with restricted coefficients.
Overfit and underfit SL models translate to the traditional variance-bias tradeoff in statistical estimation. After the break, Tibshirani was relentless in emphasizing cross-validation as a means of tuning/testing potential models to balance bias and variance. If participants took away nothing else from the two days, it was the imperative of testing models with data "independent" of the training set. And with limited data, cross-validation becomes a critical tool to achieve that goal.
After a detailed survey of both linear and logistic regression, Hastie transitioned to the forward stepwise regression procedure that uses a subset of the p inputs. Two problems with this algorithm are that overfitting can set in early and variables can't be removed once they enter the model. Better is the more "democratic" Least Angle Regression (LAR) along with methods that combat overfitting (and reduce variance) by shrinking coefficients toward zero. Most of the discussed techniques, including LAR, Ridge regression, Lasso regression and the hybrid elastic net, were developed by the instructors and their students. Algorithmic and computational advances have generated order of magnitude performance gains in some instances.
With roots in medical and genomic research, the professors are especially attentive to models where the number of inputs is greater than the number of cases. One of their illustations involved microarray data from Leukemia patients that included 38 cases and 7129 variables (genes). Lasso models appear to perform especially well with p >> N.
After discussions of the different models, the instructors pointed students toward software that implements the techniques. Not surprisingly, most of the cited modules were packages developed in R. In fact, much of the software mentioned was programmed by the instructors, their colleagues and students. It now seems that the latest SL algorithms are not considered legit until an R implementation package is published. That's certainly good news for practitioners!
The second day started with Classification and Regression Trees (CART), a long-used method implemented in just about all statistical packages. With classification problems, sensitivity, the proportion of positive cases predicted as positive, and specificity, the proportion of negative cases predicted as negative, are central. While CART is ubiquitous and easy to work with, other classification and regression methods now dominiate. I'm currently testing a CART-competing algorithm, Patient Rule Induction Method (PRIM), recommended by Tibshirani and developed by Stanford colleague Jerome Friedman.
The most notable extensions to CART are "wisdom of crowd" ensembles that average many trees to produce better classifiers. Bagging, Boosting and Random Forests are all such methods now popular with modelers. The instructors' experience suggests that predictive accuracy of CART < Bagging < Random Forests < Boosting. They particularly note the positive features of Gradient Boosting that can handle regression, classification and risk modeling. I can confirm with my own work that Boosting and Random Forests generally outperform CART for both classification and regression.
Friday afternoon was devoted primarily to unsupervised learning, with data having no singled-out response or outcome. One of the primary challenges with unsupervised data is to reduce feature set dimensionality – to extract low-dimensional features from high-dimensional data. Principal Components along with the attendant Singular Value Decomposition is a tried and true method. Hastie and Tibshirani again turned to expression arrays where p (genes) >> N (samples) to illustrate the computations. I was particulary enamored with the use of SVD to impute missing values from a data set. The instructors also used this technique to outline a solution to the Netflix movie recommendation competition.
A suggestion from my experience of three years ago will be at least partially addressed over the next 12-18 months. There's a great deal of interest in an applied companion course to SLDM III that focuses on implementing the highlighted techniques/models using packages developed in R by the statistical learning community. Hastie confided that while there are no imminent plans for such a curriculum, a book co-authored by the instructors and several students focusing on methods taught in SLDM III with lots of examples and R code will probably be published next year. I'm putting my Amazon order in early.
Though probably impractical, one analysis I'd like to see is a comparitive performance of different models on dimensions of interest to analysts. Hastie and Tibshirani addressed relative prediction accuracy with several types of data sets, including email-spam and microarrays, where the number of features is larger than the number of observations. They also contrasted computer resource consumption in certain models for which they've developed software, finding in some instances an order of magnitude difference. To be sure, prediction performance and computer resource consumption are key.
My experience is that different models perform better in some tasks than others. I often find with my tall data sets (N >> p) for example, that while ensembles deliver the best predictions, there's a substantial cpu and memory price to pay for that performance. Another consideration might be how convenient it is to include factors, interactions and curvilinearity – key for BI and data science applications -- in the models to facilitate predictions.
The more I think about it, though, the more I realize that high-level guidelines are probably the best that can be achieved. Which models are fast, which are more resource-consuming, which are better for p > N, which for p < N, which are most flexible/convenient with inputs, which are best for highly-collinear features, etc.
All things considered, I'd highly recommend SLDM for BI/analytics practitioners and data scientists. There's little question that predictive analytics' time is now. And as a foundation for PA, statistical learning fits in nicely between traditional statistical methods and computer science data mining – bringing out the best in both with less mathematical obsession than the former and less black-box mystery than the latter.
SLDM III is both a comprehensive and comprehensible survey of SL methods. Hastie and Tibshirani are at the top of their games, leading the SL world from Stanford, the #1 ranked Statistics program in the country. That they're also developing freely-available code in R to implement the latest methods long before the techniques are available in proprietary software is icing on the cake.
I can't wait for SLDM IV. After two classes in Boston though, I'll look to do the next one in Palo Alto. There's a much better chance of getting home on time.