I vividly remember one of the times as a teenager I received a dose of tough love from my parents. I had just spent two consecutive miserable days looking for a summer job without a trace of success. At dinner, I shared my frustrations with mom and dad, seeking solace. They listened intently, nodded approvingly and were seemingly sympathetic - but then dropped a parental bomb. Well, you gotta get used to it. Lifes tough. Tomorrows another day. You should try a different approach. Pick yourself up by your bootstraps.
Id never realized my folks were so literary - familiar with Adventures of Baron Munchausen, by Rudolph Erich Raspe, in which the baron saves himself from drowning in a deep lake by picking himself up by his own bootstraps. Apparently, the parents of many budding statisticians were just as well read, for the metaphor of the bootstrap is pervasive in statistical theory and machine learning today.
Though the bootstrapping techniques of the statistical world can be quite complicated, the basic concepts are remarkably simple. Statisticians are generally interested in making intelligent statements about a population when they have to make do with a fractional sample of that population as data points. An obvious example is a voting population thats sampled by a poll to predict an election outcome. Historically, statisticians have worked with strong assumptions about the probability models underpinning the population and then used arduous mathematics to derive conclusions about the adequacy of sample statistics. If the population looks like such-and-such, then sample statistics that look like this-and-that will do a reasonable job estimating population characteristics.
The bootstrap, however, uses a combination of random sampling and computer finesse to bypass the often intractable mathematics relating population to sample, encouraging statisticians to lift themselves up by their bootstraps in their quests to learn about the population. As a starting point, the bootstrap assumes the given sample is a reasonable approximation of the underlying population. Rather than derive results with math, the bootstrap uses the computer to conduct many random resamples with replacement on the original data, calculating and tracking a distribution of relevant statistics that shed light on the population. That original sample thus assumes the role of the population, and computer simulation is used to generate independent samples from that population. With statistics computed from a large number of these simulations, the resulting distributions often have desirable properties that are useful for making strong inferences about the real population. In fact, the bootstrap has performed so well that its use, along with other computer simulation techniques, is now de rigueur in the statistical world. Prudent business intelligence (BI) analysts are wise to consider adding the bootstrap to their tool chests as well.
Statistical Machine Learning
The BI world is the fortunate beneficiary of exciting collaboration in the academic world of statistics, computer science, decision science, operations research and the social and physical sciences. The statistics department at the University of California, Berkeley, with a reputation among the the best in the world, is helping to lead the charge in the field of statistical machine learning with a research mandate that is driven by applied problems in science and technology, where data streams are increasingly large-scale, dynamical and heterogeneous and where mathematical and algorithmic creativity are required to bring statistical methodology to bear.1
The approaches of learning from data can be categorized as either supervised or unsupervised. With supervised learning, the goal is to predict a known outcome measure from other input attributes. Once models are trained from outcome and input attributes of the past, they are tested with new input measures to determine how well they perform as future predictors. Examples of supervised learning candidates include fraud detection or credit worthiness in finance, up-sell/cross-sell in retail, recidivism with substance abusers and churn prediction in telecom. Unsupervised learning, on the other hand, has no outcome measure to predict. Rather, its intention is to describe the patterns of associations and correspondence among the input measures.2
Theres no shortage of statistical machine learning approaches to developing supervised predictive models. Of course, there are the traditional, statistical linear regression and classification models in addition to multivariate techniques such as linear and quadratic discriminant analysis. There are also classification trees and forests, as well as Bayesian and nearest-neighbor methods. Then there are the kernel methods, neural nets and support vector machines. Each of these attempts in some manner to optimize the accuracy of predicting outcome measures from inputs.3
The results of many of the previously discussed models can often be enhanced by the finesse of statistical approaches. Ensemble methods refer generally to the approach of improving model performance by either iteratively bundling disparate models or by averaging the models produced from multiple passes of the data.4 Ensembles thus introduce crowd concepts to machine learning methods.