Once upon a time, a predictive modeler would assemble a data set with 20 variables and maybe 2000 records, then set her favorite statistical package to work on an automated regression procedure to build a model.
Using a forward or backward stepwise algorithm – or a combination of the two approaches – the software would spit out coefficients and p-values. In the final model, the coefficients of all “surviving” variables would be statistically significant, indicating important predictors. These coefficients would, in turn, be used to forecast new observations. Life was good.
This was statistical orthodoxy back in the day – the book, if you will. That the same data were used to both train and test the regression models was problematic, however, leading to results too closely tied – “overfit” – to the specific data set. Indeed, if you were to present such a modeling scenario to the R help list today, you'd be torched, with accusations of basic assumption violations, data snooping, overfitting and other statistical atrocities. Oh how statistical times have changed!
Today, more prediction and forecasting are being done outside traditional, mainstream linear statistics. Many newer techniques are “black box” – without the benefits of estimated coefficients and significance levels but also without the baggage of strong assumptions that weigh down linear models. Indeed, the data mining approaches from computer science and the statistical learning models from the intersection of computer science and statistics look a lot different than the regression framework many of us grew up with. That's not a bad thing, though.
The continued growth of computing power over the past 20 years has pushed statistics in new directions. Now, the discipline is as much computational as it is mathematical. And computer-intensive techniques like simulation, Monte Carlo, and re-sampling are today front and center in the statistical world. More a programmer than a mathematician, I'm OK with this evolution!
In the current statistical and machine learning worlds, the re-sampling techniques of the jackknife, the bootstrap, permutation testing and cross-validation are prominent. These methods all assume that sample data is a viable proxy for the population it purports to represent. With this assumption, the distribution of repeated random sub-samples from the original sample has desirable properties that make it suitable for testing and validating predictive models. And the computer does all the heavy lifting. You just have to get used to seeing somewhat different answers each time you run your procedures!
I recently came across a comprehensible slide deck on cross-validation for evaluating predictive models. In contrast to the example above where a model is dangerously trained and tested on the same data, increasing the risk of overfit, CV involves breaking the data set into independent subsets for training and testing. In the simplest case, my 2000 record data set might initially be randomly partitioned into 1400 cases for training models and 600 for testing. The performance of different models developed in training would be evaluated on test data using mean square or absolute error. Those with lower error calculations are obviously more desirable. Models overfit to the training data would be exposed by poor performance in test.
Giving up 600 of 2000 cases for testing is a steep price to pay for evaluating model performance, however. Is there a method that would provide the benefits of testing without surrendering 30 percent of the data set? The answer is yes; the technique is called k-fold cross-validation. With this method, the entire data set is divided randomly into k equal-sized subsets of data. For each of the k subsets, models are trained against the aggregated remaining k-1 subsets and tested against the selected partition, summarizing mean square or mean absolute error on the test data. This process is repeated k times, with the error calculations averaged across all subsets. Competing models are then compared on the error calculations, lower values obviously preferred. The “sacrifice” of data for testing is 100*(1/k) percent.
For the special case where k is equal to the number of cases in the data set (2000 in our example), the technique is called leave one out validation. With large data sets, LOO is very expensive computationally. More typically, k=10, for 10-fold CV, which relinquishes 10 percent of the data for testing. False positive and false negative prediction rates are used as test criteria rather than mean square error for classification problems.
Cross-validation should become a staple evaluation method for every predictive modeler. Implementing 10-fold cross-validation with modern statistical software such as R is straightforward. Most new predictive modeling packages in R come equipped with CV, but it's also easy to program cross-validation from scratch. So there're no excuses for analysts not being attentive to testing and the dangers of overfitting with their models. What's a p-value, anyway?