I started to panic thinking about my December OpenBI Forum column. A procrastinator, I usually get a bit nervous as the submission deadline approaches, but this time was different. My column is posted the 4th Thursday of each month, and due for edit a week earlier. This year, however, the 4th Thursday is late in the month, actually after Christmas. And with a title that included Stocking Stuffer, I was justifiably concerned about timing. Not to worry. My buddy, DMReview Editor-in-Chief Mary Jo Nott, bumped me up a week in the queue so I could get out before the 25th. Now, with MJ's reprieve, all of you last minute shoppers stumped for presents to geeky business intelligence (BI) practitioner loved ones have an answer to your gift-finding dilemma - a great statistics book available for next day delivery through Amazon. Hurry, supply is limited.
There are all kinds of statistics texts in the market. At one extreme are the impenetrable mathematical theory stats books. One sighting of a triple integral is enough for me. At the other depth are statistics for the clueless - books that use an entire chapter to define the mundane correlation coefficient. They seem a waste of paper. Somewhere in between are some good texts, mathematical enough, but applied with scores of illustrations of modern methods suitable for business applications. Increasingly, these texts provide complete code solutions using state-of-the-art statistical packages like SAS, S-Plus and open source R. The code alone is often worth the price of these publications.
Even among those in-between texts there's variation. In the S-Plus and R worlds, Modern Applied Statistics by Venables and Ripley is king.1 Written at the level of Oxford masters degree students, however, it is not for the faint of heart. Similarly, Frank Harrell's beefy Regression Modeling Strategies, with generous doses of theory, examples and code, derives from masters degree-level coursework.2 Dalgaard's Introductory Statistics with R, on the other hand, is more a book on R with statistical examples to showcase the language.3
Data Analysis and Graphics Using R, by Maindonald and Braun, sits between these as an excellent intermediate-level text highly relevant to the BI world and suitable for readers with little more than an intro to stats background.4 The catalog of methods they discuss is uncanny for pertinence to BI. The strengths of this book include the directness of its encounter with research data, its advice on practical data analysis issues, the inclusion of code that reproduces analyses, careful critiques of analysis results, attention to graphical and other presentation issues, and the use of examples drawn from across the range of statistical applications, the authors conclude in the preface.
Chapter one provides an extensive introduction to the R language, focusing on functions, data types, programming constructs, objects, data access and management and graphics. To the benefit of readers, the authors are obsessed with graphs, detailing the use of both base and lattice (dimensional) plots. One of the major benefits of the object orientation of R and S-Plus is the kinship of graphics and statistical functions. Chapter 14, at the conclusion of the book, provides additional discussion of R, noting much of what comprises an intermediate level of understanding. Though a bit more terse than Dalgaard's Introductory Statistics with R, Maindonald and Braun's exposition of the R language is nonetheless first rate.
The authors put their freshly-explicated language to good use in a chapter on styles of data analysis, which borrows liberally from the exploratory data analysis (EDA) tenets of the late John Tukey. With emphasis on examining data without preconceived notions and using basic summarization techniques along with accompanying dimensional stem-and-leaf plots, boxplots, histograms, density plots, scatterplots, dotplots and time series plots, this chapter is worth close scrutiny by BI analysts looking to design dashboards to measure corporate performance.
Four full chapters are devoted to the topics of regression analysis and linear models, perhaps the statistical focus of most importance to BI analysts. While not quite the exhaustive treatment of Regression Modeling Strategies, the material is still comprehensive, building from elementary regression with a single predictor, to multiple linear regression, to the treatment of factors/indicator independent variables, to smoothing techniques, to generalized linear models and logistic regression, to ordinal regression and, finally, to survival analysis. For each technique, the authors articulate statistical best practices, following up with solid examples and pertinent R code. Critical to their approach is persistent demonstration of the tight linkages between statistical procedures and graphs.
After a quick chapter on time series models that illustrates useful basic techniques and code, the authors undertake a discussion of multilevel models and repeated measures that is particularly timely for BI. The Journal Report from the Wall Street Journal of December 1 to 2, 2007 has a fascinating article, entitled Raising Your Marketing IQ, that promotes the use of panel or longitudinal surveys for marketing intelligence. Such marketing panels are sweeping, detailed and continuing surveys of a large, carefully selected group of consumers who reflect a statistically reliable sample of a much larger market ... taken every six to 12 months or so. 5 A well-known example is the tracking of television viewing habits by Nielsen Co. Panels are a much richer source of over time information about the buying population than what is available from existing customer activity databases, one-time surveys and focus groups. At the same time, panels are expensive and must continually demonstrate appropriate return on investment. The statistical models discussed in this chapter - random effects, nested, clustering, repeated measures and multilevel designs - are well suited to making sense of this type of over-time data.
Maindonald and Braun expand the horizon of statistical analysis to include coverage of the popular tree-based classification and regression machine learning techniques. The treatment of the rpart and randomForest procedures is a godsend for R users, since the packages are extensions to the core language, with comprehensive examples and documentation difficult to find. The authors also contrast tree-based approaches with traditional statistical regression, noting that machine learning is appropriate for large data sets with limited assumptions needed for inference, while statistical models are appropriate for smaller data sets with stronger parametric requisites.
The survey of multivariate data analysis includes examination of several methods used to reduce dimensionality in data sets important for BI. The authors first introduce multidimensional scatterplots and perspective plots, providing a visual point of departure for principal component analysis. They then discuss the discriminant function, a multivariate alternative to logistic regression and classification trees, and outline best-practice approaches of adopting training and testing data for applying the techniques in the field. The reduced dimensionality that results from these models can then be used as input for predictive models, either regressions or trees.
The final multivariate technique discussed, propensity scores, is an approach near and dear to marketing analysts everywhere. The OpenBI Forum has long espoused the use of randomized experiments as an optimal way of determining the effects of a given strategy. There are many situations, however, for which randomization is impractical or even unethical. In instances where treatment group assignment is determined non-randomly, there's the potential for systematic bias that would invalidate comparisons. If a marketing offer is given to group A only, with group B measured as control, without random assignment it could well be the case that groups A and B are different out of the gate - and that difference, not the offer itself, might cause the measured response.
Propensity scoring is a statistical technique that attempts to adjust for this potential confounding by consolidating information on other pertinent variables measured of study participants. In its simplest form, covariates measured for all participants are used to predict into which group, treatment A or control B, an individual belongs. The propensity scores associated with the predictions are then used to adjust the comparison of treatment and control offer responses. If indeed differences between group membership and not the offer itself are responsible for the noted difference, the inclusion of propensity scores will dampen the treatment effect. By contrast, if the offer and not group differences determine response, the addition of propensity scores to the model will have little impact on the results. Maindonald and Braun outline a propensity score adjustment method using logistic regression that has applicability to a wide range of business analyses subject to non-experimental contamination. And, it goes without saying, the authors provide code and graphics to implement their approach.
From the OpenBI Forum to stocking stuffers everywhere, Happy Holidays!
- W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer-Verlag. Fourth Edition, 2002.
- Frank E. Harrell, Jr. Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer-Verlag. 2001.
- Peter Dalgaard. Introductory Statistics with R. Springer-Verlag. 2002.
- John Maindonald and John Braun. Data Analysis and Graphics Using R - An Example-Based Approach. Cambridge University Press. Second Edition, 2007.
- Calvin P. Duncan, Constance M. O'Hare and John M. Matthews. Raising Your Market IQ. The Wall Street Journal. December 1 to 2, 2007.