Predictive Models, Mars to Earth – Part 1
I'm not sure exactly why, but predictive analytics seems to be front and center for OpenBI as winter turns to spring. Existing customers, having built their BI platforms, appear ready to elevate from retrospective dashboards and OLAP cubes to prospective predictive models. And new customers are prioritizing analytics earlier now, instead of relegating models to “phase 2 – after we've built the platform.”
IBM's “Smarter Planet” initiative and SPSS acquisition are certainly helping to move analytics to the forefront, as is the enhanced modeling attention of Oracle, Microsoft and SAP. The R Project for Statistical Computing continues to dazzle in the open source world, with exciting new leadership at Revolution Computing promising to align commercial R with business needs. And let's not forget analytics juggernaut SAS, who recently reported yet another record year.
I'm personally jazzed about predictive analytics as well. I just returned from the highly-successful Predictive Analytics World in San Francisco. Three excellent new books covering the latest developments surrounding the business, theory, and application of predictive models are prominent on my desk. I'm now collaborating with R commercial vendor Revolution Computing and SAS compiler company World Programming Systems, while investigating statistical/data mining vendors Salford Systems and StatSoft, to meet the enhanced demand OpenBI sees in the analytics market. I'm also intrigued with visualization leader Spotfire, whose new 3.1 release includes statistics services to access R-based capabilities. I'll evaluate that high-profile marriage in future blogs.
A month or so ago, I checked on the Machine Learning Task View to get a handle on the latest predictive learning methods developed by the R community. In addition to traditional modeling tools like multiple and logistic regression, the R project has developed more machine learning packages than you can imagine. One that's really caught my fancy is Multivariate Adaptive Regression Splines (Mars), implemented as the “earth” package in R.
Mars is a great illustration of statistical learning, an analytic discipline championed at Stanford that sits between traditional statistical modeling and the machine learning of computer and mathematical science. According to Stanford statisticians Trevor Hastie, Robert Tibsharani and Jerome Friedman “Mars is an adaptive procedure for regression, and is well suited for high-dimensional (i.e., a large number of inputs). It can be viewed as a generalization of stepwise linear regression or a modification of the CART procedure to improve the latter's performance in the regression setting.” Richard Berk, author of Statistical Learning from a Regression Perspective, notes that in contrast to traditional least squares regression: “In statistical learning, there is far less reliance on prior information when functional forms are determined to link predictors to the response.....the functional forms are, by and large, arrived at inductively from the data.”
Inductive and adaptive well characterize Mars and statistical learning in general. Splines is a term borrowed from engineering where, before computers, draftsmen used long, flexible strips of plastic or metal “splines” to draw curves that changed shapes at points called knots. Mars adapts to its data in a splines-like way by fitting piecewise linear basis functions of the explanatory variables that best predict the dependent variable. Together, the many fitted pieces often resemble a polynomial curve. An initial forward pass in the Mars algorithm chooses the predictors and accompanying basis or hinge functions so as maximize the reduction in the sum-of-squares residual error. The process continues until a given number of terms is reached or the residual reduction is not meaningful. A major advantage of Mars is the routinized (and tunable) handling of curvilinearity and interactions among predictors, both of which are challenges with parametric linear regression. At the same time, Mars is flexible in providing modelers the option to enter predictors linearly, in which case the results look a lot like least squares regression.
After the forward pass, the usually large (many predictors) model that remains is almost always overfit to its training data, the “victim” of an effective adaptive modeling process. Overfitting is odious since the “model” that emerges from training will generally not project to new data. To counter, models from the forward pass are then “pruned”, much like they are with CART (Classification and Regressions Trees), the least effective terms at each step removed one by one, until the best submodels are found using a criterion called generalized cross validation (GCV). Models can then be further cross-validated with test data to assess their fidelity. Analysts can also address overfitting by proactively limiting the number of terms in the model and “penalizing” new entrants on the forward pass.
Mars has a lot to offer as a predictive modeling mainstay. Though it's a non-parametric technique that makes no assumptions on how dependent variables relate to predictors, Mars feels a lot like traditional least squares regression, albeit with much more flexibility, and is easier to interpret than “black box” machine learners like neural nets and random forests. That it can handle continuous and categorical independent and dependent variables makes Mars powerfully general-purpose. Like CART and stepwise regression, Mars computations can be automated, modelers having to choose only input variables and tuning parameters. In addition, Mars offers a reasonable compromise to the bias/variance conundrum and has developed a justified reputation for predictive accuracy – though not quite as good as the more computationally-intensive bagging or boosting. And Mars is pretty efficient, able to handle “large” models in a reasonable amount of time and computer resources – dependent, of course, on execution options.
I'm in the process of evaluating Mars and other statistical learning models with R and commercial software from Salford Systems and StatSoft. My work data set consists of over 500,000 census records comprised of demographic and job-related attributes of individuals, including age, sex, race, marital status, education, annual wages, region of residence, job category, health status, health insurance status, etc. Over the next few months, I'll report some of my impressions predicting wages as a function of the other attributes using the different statistical packages and modeling techniques like Mars.