I received an email from software vendor Salford Systems the other day announcing new training classes for their data mining and predictive analytics product SPM, Salford Predictive Modeler. SPM 7 supports a number of the latest and most powerful statistical learning models developed by academics at Stanford and Cal, the two top-rated statistics departments hence the name Salford. The "bible" that underpins much of this work, "The Elements of Statistical Learning," was co-authored by Stanford stats faculty Trevor Hastie, Robert Tibshirani and Jerome Friedman.
SPM is all about supervised learning, in which a number of inputs, predictors, independent variables or features are used to predict outputs, responses or dependent variables. The features and responses can be either qualitative, such as gender, or quantitative, like income. Problems for which the response is qualitative are known as classifications; where the output is quantitative they're called regressions. The models supported by SPM readily handle both.
I was very high on the demo version of SPM I tested a while back. SPM's models are state-of-the-art. I also found the GUI easy to navigate and liked the way the interface "leads" the user down a methodologically sound path, probing for model options with reasonable defaults. Clean and simple. In addition, SPM systematically addresses the dreaded problem of overfit models by promoting both random splits into training and test data sets and cross validation.
The models comprising SPM 7 include CART (Classification and Regression Trees), MARS (Multivariate Adaptive Regression Splines), Random Forests and TreeNet Gradient Boosting, all of which have significant Stanford and Berkeley legacy. Indeed, Salford has a copyright on the name MARS, which was developed by Stanford's Friedman. SPM comes in a variety of packages, the more "exclusive" of which include even more sophisticated modeling variations.
I've found the Salford tutorial videos quite informative and would recommend them to anyone desiring an introduction to statistical learning even if they're not SPM users. The data sets used in the examples are available for download, and can serve either SPM or other SL tool. For me that other tool is the R Project for Statistical Computing.
It turns out my two favorite learning models are MARS and Random Forest, both of which are also available in R. In a tweak on the copyright protection, the R adaptive regression splines variant is included in the “earth” package, while RF is part of randomForest. Both are mature components highly-used by the R community.
In contrast to SPM 7, which leads the analyst down a sound path with a knowledgeable GUI driven by reasonable defaults, the R programmer gets the defaults but must otherwise fend for herself in R programming scripts. Even those with a even modicum of R programming experience, however, can readily translate from SPM 7's GUI to R function calls.
The Salford MARS tutorial involves the ubiquitous Boston Housing data set. The regression challenge is to predict the median housing value from a sample of 506 sales in the early 70's as a function of age, number of rooms, local crime rate, percent lower status population, et. al. MARS is an especially appropriate algorithm for this since the relationship between the response and features involves non-linearities readily handled by its basis functions.
I was able to produce similar computations to the tutorial using the earth package in R. Since I randomly split the the entire data set into an 80% train and 20% test, my exact results were somewhat different, but the feature importance rankings and plotted basis functions for each variable were aligned.
The Random Forest tutorial involves a prostate cancers classification problem that deploys 111 characteristics of 772 tissue samples to predict whether cancer is present, and if so, which type. The target variable contains three values: 0 indicating benign, 1 indicating one type of malignant cancer, and 2 indicating another type of malignant cancer.
Using R's randomForest package, I was able to pretty faithfully follow the Salford tutorial steps. Since sampling's part and parcel of the Random Forest algorithm, my findings were somewhat different than theirs. Seventeen of the top twenty importance variables identified in the Salford tutorial were replicated in mine. Interesting, though, were the divergences in the Prediction Success Tables, aka Confusion Matrices. Overall both models were close in accuracy, but my run was more successful predicting benign when in fact there was no cancer, and trailed the Salford model badly in predicting the first type of cancer. If both models were run a second time, the results would be somewhat different.
I would encourage statistical newbies looking to discover the correct way to develop/deploy the latest in statistical learning modeling techniques to experiment with SPM demo software and work through the tutorials with the data sets provided. The choice of SPM for your supervised modeling needs is a good one, but even if you decide to go with other SL variants from platforms such as R or SAS, you can still learn a lot about the modeling process from Salford.