DEC 19, 2007 3:19am ET

Related Links

Innovative Organizations Likely to have More Pervasive BI and Data Governance
September 2, 2014
Revolutionize Your Business Intelligence with Lean, High-Performance Solutions
August 21, 2014
Should You Always Obey Orders from Your Executives?
August 7, 2014

Web Seminars

Why Data Virtualization Can Save the Data Warehouse
September 17, 2014
Essential Guide to Using Data Virtualization for Big Data Analytics
September 24, 2014

A Statistical Stocking Stuffer for the Holidays


I started to panic thinking about my December OpenBI Forum column. A procrastinator, I usually get a bit nervous as the submission deadline approaches, but this time was different. My column is posted the 4th Thursday of each month, and due for edit a week earlier. This year, however, the 4th Thursday is late in the month, actually after Christmas. And with a title that included Stocking Stuffer, I was justifiably concerned about timing. Not to worry. My buddy, DMReview Editor-in-Chief Mary Jo Nott, bumped me up a week in the queue so I could get out before the 25th. Now, with MJ's reprieve, all of you last minute shoppers stumped for presents to geeky business intelligence (BI) practitioner loved ones have an answer to your gift-finding dilemma - a great statistics book available for next day delivery through Amazon. Hurry, supply is limited.


There are all kinds of statistics texts in the market. At one extreme are the impenetrable mathematical theory stats books. One sighting of a triple integral is enough for me. At the other depth are statistics for the clueless - books that use an entire chapter to define the mundane correlation coefficient. They seem a waste of paper. Somewhere in between are some good texts, mathematical enough, but applied with scores of illustrations of modern methods suitable for business applications. Increasingly, these texts provide complete code solutions using state-of-the-art statistical packages like SAS, S-Plus and open source R. The code alone is often worth the price of these publications.


Even among those in-between texts there's variation. In the S-Plus and R worlds, Modern Applied Statistics by Venables and Ripley is king.1 Written at the level of Oxford master’s degree students, however, it is not for the faint of heart. Similarly, Frank Harrell's beefy Regression Modeling Strategies, with generous doses of theory, examples and code, derives from master’s degree-level coursework.2 Dalgaard's Introductory Statistics with R, on the other hand, is more a book on R with statistical examples to showcase the language.3


Data Analysis and Graphics Using R, by Maindonald and Braun, sits between these as an excellent intermediate-level text highly relevant to the BI world and suitable for readers with little more than an intro to stats background.4 The catalog of methods they discuss is uncanny for pertinence to BI. “The strengths of this book include the directness of its encounter with research data, its advice on practical data analysis issues, the inclusion of code that reproduces analyses, careful critiques of analysis results, attention to graphical and other presentation issues, and the use of examples drawn from across the range of statistical applications,” the authors conclude in the preface.


Chapter one provides an extensive introduction to the R language, focusing on functions, data types, programming constructs, objects, data access and management and graphics. To the benefit of readers, the authors are obsessed with graphs, detailing the use of both base and lattice (dimensional) plots. One of the major benefits of the object orientation of R and S-Plus is the kinship of graphics and statistical functions. Chapter 14, at the conclusion of the book, provides additional discussion of R, noting much of what comprises an intermediate level of understanding. Though a bit more terse than Dalgaard's Introductory Statistics with R, Maindonald and Braun's exposition of the R language is nonetheless first rate.


The authors put their freshly-explicated language to good use in a chapter on styles of data analysis, which borrows liberally from the exploratory data analysis (EDA) tenets of the late John Tukey. With emphasis on examining data without preconceived notions and using basic summarization techniques along with accompanying dimensional stem-and-leaf plots, boxplots, histograms, density plots, scatterplots, dotplots and time series plots, this chapter is worth close scrutiny by BI analysts looking to design dashboards to measure corporate performance.


Four full chapters are devoted to the topics of regression analysis and linear models, perhaps the statistical focus of most importance to BI analysts. While not quite the exhaustive treatment of Regression Modeling Strategies, the material is still comprehensive, building from elementary regression with a single predictor, to multiple linear regression, to the treatment of factors/indicator independent variables, to smoothing techniques, to generalized linear models and logistic regression, to ordinal regression and, finally, to survival analysis. For each technique, the authors articulate statistical best practices, following up with solid examples and pertinent R code. Critical to their approach is persistent demonstration of the tight linkages between statistical procedures and graphs.


After a quick chapter on time series models that illustrates useful basic techniques and code, the authors undertake a discussion of multilevel models and repeated measures that is particularly timely for BI. The Journal Report from the Wall Street Journal of December 1 to 2, 2007 has a fascinating article, entitled Raising Your Marketing IQ, that promotes the use of panel or longitudinal surveys for marketing intelligence. Such marketing panels are “sweeping, detailed and continuing surveys of a large, carefully selected group of consumers who reflect a statistically reliable sample of a much larger market ... taken every six to 12 months or so.” 5 A well-known example is the tracking of television viewing habits by Nielsen Co. Panels are a much richer source of over time information about the buying population than what is available from existing customer activity databases, one-time surveys and focus groups. At the same time, panels are expensive and must continually demonstrate appropriate return on investment. The statistical models discussed in this chapter - random effects, nested, clustering, repeated measures and multilevel designs - are well suited to making sense of this type of over-time data.


Get access to this article and thousands more...

All Information Management articles are archived after 7 days. REGISTER NOW for unlimited access to all recently archived articles, as well as thousands of searchable stories. Registered Members also gain access to:

  • Full access to including all searchable archived content
  • Exclusive E-Newsletters delivering the latest headlines to your inbox
  • Access to White Papers, Web Seminars, and Blog Discussions
  • Discounts to upcoming conferences & events
  • Uninterrupted access to all sponsored content, and MORE!

Already Registered?


Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
Please note you must now log in with your email address and password.