I’ve spent parts of the last few days developing a presentation on statistical learning for Friday, August 24’s OpenBI Day in Chicago.
Then and there, all OpenBI consultants will gather downtown for 10 hours of cross pollination on developments with existing projects, technical presentations by staff and a happy hour++. In addition to the state of our business, my contribution will be a one-hour introduction to SL with R.
When I first sat down to assemble the SL material, I started feeling anxious, fretting over what to include for the session. My concerns were that I could easily speak for three or four hours, and I didn’t want to overwhelm the group with geeky stats stuff.
The point of departure for the presentation is the challenge to the Statistics discipline to maintain its stature in statistical science. Ten years ago, the late Berkeley Statistician Leo Breiman admonished his colleagues that Statistics was at risk of becoming irrelevant in an increasingly data-rich world because of its compulsion with mathematical models: “This commitment (to models) has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. … If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.” Sounds like the critique of economic models four years ago.
In an interview a few years back, eminent Stanford Statistician Brad Efron, though not a hand-wringer like Breiman, nonetheless acknowledged that Statistics is now but one among many players in the statistical science world. “There is now much more statistical work being done in the scientific disciplines, what with biometrics/biostatistics, econometrics, psychometrics, etc. – and business as well. Statistics is now even entrenched in hard sciences like physics. There are also the computer science/artificial intelligence contributions of machine learning and other data mining techniques.”
I pretty much agree with these assessments, and like the path that statistical learning has taken as a compromise between traditional top-down, mathematics-driven Statistics and bottom-up machine learning from Computer Science. SL’s generally without the onerous data-generating model assumptions that so bothered Breiman, its approach much more algorithmic, with challenges of optimization, programming and computation often replacing mathematical derivations. At the same, an obsession with the importance of validation and perils of model overfitting has statistical learning aptly cautious. It’s like having the learning cake and eating it too.
From this starting point, I settled on a few themes for my talk, deciding at the outset to limit the discussion to supervised learning, which relates a set of features or independent variables to a response, outcome or dependent variable. The outcome can either be a numeric measure, in which case the learning is referred to as regression, or a categorical variable, where the learning is known as classification. Features can be either numeric or categorical.
I would explain the bias-variance tradeoff. I’d introduce the often-overlooked risk of fitting models to a single set of training data. I’d pay more than lip service attention to the overfitting-redressing strategies of separate training, tuning and testing data sets, cross-validation and the bootstrap. And I’d highlight shrinkage/regularization techniques that combat overfitting by limiting the size of coefficients – and often eliminating features altogether.
Besides a slide deck to introduce the concepts, I put together a pretty comprehensive script showcasing half a dozen SL methods using the R Platform for Statistical Computing, drawing a random sample of 100,000 from my 5.4M record Current Population Survey (CPS) data set. The intent was to show the power of R as an integrated data analysis platform that supports object-oriented programming, data management, statistical analysis and visualization. I also wished to introduce the non-R staff to how a modern analyst might think using R.
I decided to showcase a number of different types of techniques, including traditional stats models, shrinkage models, and ensemble methods. Specific illustrations include linear and logistic regression, multivariate adaptive regression splines, generalized additive models, random forests, gradient boosting, and generalized linear models via penalized maximum likelihood (yikes!). The oft-discussed support vector machines model was in my initial plan but dropped for poor computational performance with my modest-sized data set.
After several passes, I’m finally feeling pretty comfortable with what I’ve assembled. We’ll see. I’ll find out if the presentation’s well-received not long after this is posted.