The Bootstrap

I vividly remember one of the times as a teenager I received a dose of tough love from my parents. I had just spent two consecutive miserable days looking for a summer job without a trace of success. At dinner, I shared my frustrations with mom and dad, seeking solace. They listened intently, nodded approvingly and were seemingly sympathetic - but then dropped a parental bomb. “Well, you gotta get used to it. Life’s tough. Tomorrow’s another day. You should try a different approach. Pick yourself up by your bootstraps

I’d never realized my folks were so literary - familiar with Adventures of Baron Munchausen, by Rudolph Erich Raspe, in which the baron saves himself from drowning in a deep lake “by picking himself up by his own bootstraps.” Apparently, the parents of many budding statisticians were just as well read, for the metaphor of the bootstrap is pervasive in statistical theory and machine learning today.

Though the bootstrapping techniques of the statistical world can be quite complicated, the basic concepts are remarkably simple. Statisticians are generally interested in making intelligent statements about a population when they have to make do with a fractional sample of that population as data points. An obvious example is a voting population that’s sampled by a poll to predict an election outcome. Historically, statisticians have worked with strong assumptions about the probability models underpinning the population and then used arduous mathematics to derive conclusions about the adequacy of sample statistics. If the population looks like “such-and-such,” then sample statistics that look like “this-and-that” will do a reasonable job estimating population characteristics.

The bootstrap, however, uses a combination of random sampling and computer finesse to bypass the often intractable mathematics relating population to sample, encouraging statisticians to “lift themselves up by their bootstraps” in their quests to learn about the population. As a starting point, the bootstrap assumes the given sample is a reasonable approximation of the underlying population. Rather than derive results with math, the bootstrap uses the computer to conduct many random resamples with replacement on the original data, calculating and tracking a distribution of relevant statistics that shed light on the population. That original sample thus assumes the role of the population, and computer simulation is used to generate independent samples from that population. With statistics computed from a large number of these simulations, the resulting distributions often have desirable properties that are useful for making strong inferences about the real population. In fact, the bootstrap has performed so well that its use, along with other computer simulation techniques, is now de rigueur in the statistical world. Prudent business intelligence (BI) analysts are wise to consider adding the bootstrap to their tool chests as well.

Statistical Machine Learning

The BI world is the fortunate beneficiary of exciting collaboration in the academic world of statistics, computer science, decision science, operations research and the social and physical sciences. The statistics department at the University of California, Berkeley, with a reputation among the the best in the world, is helping to lead the charge in the field of statistical machine learning with a research mandate that “is driven by applied problems in science and technology, where data streams are increasingly large-scale, dynamical and heterogeneous and where mathematical and algorithmic creativity are required to bring statistical methodology to bear.”1

The approaches of learning from data can be categorized as either supervised or unsupervised. With supervised learning, the goal is to predict a known outcome measure from other input attributes. Once models are “trained” from outcome and input attributes of the past, they are tested with new input measures to determine how well they perform as future predictors. Examples of supervised learning candidates include fraud detection or credit worthiness in finance, up-sell/cross-sell in retail, recidivism with substance abusers and churn prediction in telecom. Unsupervised learning, on the other hand, has no outcome measure to predict. Rather, its intention is to describe the patterns of associations and correspondence among the input measures.2

There’s no shortage of statistical machine learning approaches to developing supervised predictive models. Of course, there are the traditional, statistical linear regression and classification models in addition to multivariate techniques such as linear and quadratic discriminant analysis. There are also classification trees and forests, as well as Bayesian and nearest-neighbor methods. Then there are the kernel methods, neural nets and support vector machines. Each of these attempts in some manner to optimize the accuracy of predicting outcome measures from inputs.3

Bagging

The results of many of the previously discussed models can often be enhanced by the finesse of statistical approaches. Ensemble methods refer generally to the approach of improving model performance by either iteratively bundling disparate models or by averaging the models produced from multiple passes of the data.4 Ensembles thus introduce crowd concepts to machine learning methods.

Bagging, or bootstrap aggregating, uses bootstrap concepts to resample training data many times, ultimately producing a unified model from an “averaging” of results of the individual computations. These averaged or bagged predictors are often superior to those from a single data pass. Research shows that bagging produces the largest gains in performance when individual bootstrap samples lead to wide variation in prediction coefficients. The averaging of these disparate predictors may produce enhanced predictions.5

Boosting

A second ensemble technique that is used to improve the accuracy of predictions of a single algorithm is boosting. The boosting methodology is built on the premise that while it’s generally very difficult to build a single, highly accurate decision rule, it’s often easy to come up with simple rules of thumb - weak learners - that have at least some accuracy superior to random guessing. A gambler attempting to pick the results of professional football games might, for example, use team records in the last five games and key personnel injuries as starter weak learners.

The challenge for boosting is first to derive a comprehensive list of the weak learners and then to combine them in a way that produces superior predictions. One approach that has proven successful is to formulate the weak learners so each new one in turn is focused on cases misclassified by its predecessors. Once the weak learner list is complete, a simple majority rule determines the final prediction.6

The Wisdom of Crowds Redux

Part 1 of BI Ensemble series discussed James Surowiecki’s The Wisdom of Crowds, which argues the nonintuitive notion that under certain controlled conditions groups can make estimates and decisions that are, in many cases, superior to those of individuals - even experts. Surowiecki then outlines a number of conditions that, taken together, provide a supportive environment for the wisdom of crowds.7 The first is diversity of opinion. Each individual in the group should have private information that is offered for the group decision. The information held by individuals should be independent, so that the decision-making influence of person A on person B is minimal. Individuals in the group should draw on personal and local information, so that the group is decentralized. Finally, there needs to be a mechanism to aggregate the disparate information into a collective decision.8

Developments in statistical and machine learning methodologies indicate that “crowd” wisdom is not constrained to human decision-making. The statistical bootstrap creates its own crowd to opine on decisions, using independent resamples of the diverse population in such a way that each decentralized sample has an equal opportunity to participate. Collective statistics are then aggregated and computed, often producing superior results.

Bagging applies crowd-enhancing bootstrap principals to models derived from machine learning. In many cases, it achieves superior predictive accuracy, especially when there’s a diversity of results from individual cases. And, finally, the wisdom of boosting demonstrates that the whole is often greater than the sum of the parts, with a series of weak individual predictions that aggregate into a strong final model.

A subsequent BI Ensemble column will expand on the use of crowd wisdom for BI, with special emphasis on Web 2.0 applications.

References:

  1. University of California, Berkeley, Department of Statistics Faculty. ”Statiscial Machine Learning: Research Statement.” University of California, Berkeley, Department of Statistics Web page, 2008. http://www.stat.berkeley.edu/users/gvrocha/statlearning/.
  2. Trevor Hastie, Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer: 2001.
  3. Hastie, et. al.
  4. Richard A. Berk. “An Introduction to Ensemble Methods for Data Analysis.” Department of Statistics, UCLA, March 27, 2005.
  5. Leo Breiman. “Bagging Predictors.” Department of Statistics, University of California at Berkeley, September 1994.
  6. Robert E. Schapire. “The Boosting Approach to Machine Learning: An Overview.” AT&T Labs 􀀀 Research, December 19, 2001.
  7. James Surowiecki. The Wisdom of Crowds. Anchor Books: 2004.
  8. Surowiecki.