Im getting ready to start another predictive modeling effort and decided to turn to several trusted stats books for a quick review. Three favorites include Maindonald and Brauns Data Analysis and Graphics Using R, The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman, and Frank Harrells Regression Modeling Strategies. The books provide a nice balance of theory and practice, statistical inference and statistical learning.
I didnt even get past the Preface to RMS before I started taking notes on important considerations for planning my new prediction studies. Indeed, I found the emphases spot on, even though Im not certain whether Ill use the regression models that Frank espouses or the statistical learning models of ESL.
The following are nuggets of wisdom from RMS for planning/executing modeling studies, along with a statistical bloggers commentary:
- The cost of data collection outweighs the cost of data analysis. This means its critical to maximize the value of data in hand and to analyze it judiciously. It also underscores the oft-heard warning from Predictive Analytics World that quality data is perhaps the leading critical risk/success factor for predictive analytics projects.
- Prudent handling of missing data is critical. Simple deletion of cases for which there are missing attributes can lead to prediction coefficients that are either terribly biased or grossly inefficient. Therere well-developed methodologies and statistical procedures for imputing missing values that should be a part of the analysts arsenal.
- Mean square error, which equals variance + bias, is generally a criterion for evaluating a model. Statisticians often look first for unbiased estimates, but it may be better in many cases to trade off a small amount of bias for reduced variance.
- Analysts need to pay special attention to non-linearity and non-additivity in their models. The careless deployment of simple linear models is often a by-product of the regression capabilities of BI tools. A miss-specified model may lead to erroneous predictions and results. Techniques like cubic splines are available for testing and incorporating these complications in standard models.
- Graphical methods to support the understanding of complex models are critical. The connection of predictive models to graphics is particularly strong in R. The lattice graphics pioneered by William Cleveland and included in R are central to its productivity and popularity.
- Methods for handling large numbers of predictors are central to todays predictive models. Fortunately, there are answers like data reduction methods (e.g. principal components) from the multivariate statistics world, as well as Least Angle Regression (LARS), the Lasso, Random Forests, and Gradient Boosting from statistical learning.
- Overfitting is a common problem. Model validation approaches that include the bootstrap and cross validation are now central to estimating and testing. The stepwise regression procedures I learned in grad school 30 years ago are now non-grata in the prediction world. Fortunately, resampling techniques that are part and parcel of statistical practice have come to the rescue.