I started to panic thinking about my December OpenBI Forum column. A procrastinator, I usually get a bit nervous as the submission deadline approaches, but this time was different. My column is posted the 4th Thursday of each month, and due for edit a week earlier. This year, however, the 4th Thursday is late in the month, actually after Christmas. And with a title that included Stocking Stuffer, I was justifiably concerned about timing. Not to worry. My buddy, DMReview Editor-in-Chief Mary Jo Nott, bumped me up a week in the queue so I could get out before the 25th. Now, with MJ's reprieve, all of you last minute shoppers stumped for presents to geeky business intelligence (BI) practitioner loved ones have an answer to your gift-finding dilemma - a great statistics book available for next day delivery through Amazon. Hurry, supply is limited.
There are all kinds of statistics texts in the market. At one extreme are the impenetrable mathematical theory stats books. One sighting of a triple integral is enough for me. At the other depth are statistics for the clueless - books that use an entire chapter to define the mundane correlation coefficient. They seem a waste of paper. Somewhere in between are some good texts, mathematical enough, but applied with scores of illustrations of modern methods suitable for business applications. Increasingly, these texts provide complete code solutions using state-of-the-art statistical packages like SAS, S-Plus and open source R. The code alone is often worth the price of these publications.
Even among those in-between texts there's variation. In the S-Plus and R worlds, Modern Applied Statistics by Venables and Ripley is king.1 Written at the level of Oxford masters degree students, however, it is not for the faint of heart. Similarly, Frank Harrell's beefy Regression Modeling Strategies, with generous doses of theory, examples and code, derives from masters degree-level coursework.2 Dalgaard's Introductory Statistics with R, on the other hand, is more a book on R with statistical examples to showcase the language.3
Data Analysis and Graphics Using R, by Maindonald and Braun, sits between these as an excellent intermediate-level text highly relevant to the BI world and suitable for readers with little more than an intro to stats background.4 The catalog of methods they discuss is uncanny for pertinence to BI. The strengths of this book include the directness of its encounter with research data, its advice on practical data analysis issues, the inclusion of code that reproduces analyses, careful critiques of analysis results, attention to graphical and other presentation issues, and the use of examples drawn from across the range of statistical applications, the authors conclude in the preface.
Chapter one provides an extensive introduction to the R language, focusing on functions, data types, programming constructs, objects, data access and management and graphics. To the benefit of readers, the authors are obsessed with graphs, detailing the use of both base and lattice (dimensional) plots. One of the major benefits of the object orientation of R and S-Plus is the kinship of graphics and statistical functions. Chapter 14, at the conclusion of the book, provides additional discussion of R, noting much of what comprises an intermediate level of understanding. Though a bit more terse than Dalgaard's Introductory Statistics with R, Maindonald and Braun's exposition of the R language is nonetheless first rate.
The authors put their freshly-explicated language to good use in a chapter on styles of data analysis, which borrows liberally from the exploratory data analysis (EDA) tenets of the late John Tukey. With emphasis on examining data without preconceived notions and using basic summarization techniques along with accompanying dimensional stem-and-leaf plots, boxplots, histograms, density plots, scatterplots, dotplots and time series plots, this chapter is worth close scrutiny by BI analysts looking to design dashboards to measure corporate performance.
Four full chapters are devoted to the topics of regression analysis and linear models, perhaps the statistical focus of most importance to BI analysts. While not quite the exhaustive treatment of Regression Modeling Strategies, the material is still comprehensive, building from elementary regression with a single predictor, to multiple linear regression, to the treatment of factors/indicator independent variables, to smoothing techniques, to generalized linear models and logistic regression, to ordinal regression and, finally, to survival analysis. For each technique, the authors articulate statistical best practices, following up with solid examples and pertinent R code. Critical to their approach is persistent demonstration of the tight linkages between statistical procedures and graphs.