A few months ago, my company updated its email service, in the process loosening what were pretty tight default spam filters. Now, I rarely miss an important email, but at the expense of having to manually dispatch quite a few more bogus messages. I guess I’m trading off Type II errors for Type I. 

One set of advertisements that now regularly finds my inbox is from analytics company Statistical Horizons, which conducts seminars of interest to applied statisticians and data analysts. The message that first got my attention was promoting a class entitled Missing Data, a topic currently near and dear to my data science work. Much has evolved in the treatment of missing data in recent years, and there are now many statistical packages available to help data-deprived analysts.

Not only did the course seem relevant, but I recognized the name of the instructor, Paul Allison. I’d attended grad school many years ago with a Paul Allison. Could this be the same guy? A quick examination confirmed that it was: Paul Allison, Ph.D., President of Statistical Horizons and Professor at the University of Pennsylvania. It’s good to know at least some of my compatriots have found success!

Reviewing Allison’s vitae, I noticed he’d written a book on missing data, so he seemed a natural for this seminar. I was also impressed with the course synopsis, which cut to the chase of the “what and why” of new techniques for handling missing data in statistical analysis, opining: “Older methods … are prone to three serious problems: Inefficient use of available information … Biased estimates of standard errors … and,  biased parameter estimates …”  Newer methods, in contrast, “depend on less demanding assumptions than those required for conventional methods … Maximum likelihood is available for linear models, logistic regression and Cox regression. Multiple imputation can be used for virtually any statistical problem. …This course will cover the theory and practice of both maximum likelihood and multiple imputation.” Got my statistical juices going!

Statistical Horizons will be quite busy this Fall, delivering no less than 11, two-day classes, many of interest to data scientists. In addition to Missing Data, Allison will teach Survival Analysis Using Stata, and Longitudinal Data Analysis with Stata, both relevant for business quants. Indeed, panel designs are pervasive in BI, and I’m increasingly seeing “censored” data with OpenBI customers seeking to answer questions like if/when will my customers defect and if/when will my products fail.

Other SH courses of interest to data scientists include Social Network Analysis, Propensity Score Analysis, Data Mining, and Statistics with R. At just about any data science gathering these days, social network analysis is front and center. The Exponential Random Graph Model (ERGM) presented in SNA particularly caught my eye, the prospect of moving “beyond simple peer association models (network autocorrelation models) toward disentangling causality from selection” quite intriguing.

Propensity analysis is an important tool for data scientists who don’t have the luxury of randomized experiments to test their hypotheses. In fact, I like to think of it as a technique for statistical cleansing, allowing analysts to summarize one or more potentially confounding attributes into a single score that will, in turn, drive the matching of experimental and control subjects to produce unbiased estimates of intervention effects.

The description of Robert Stine’s Data Mining course reminds me of the wonderful Statistical Learning and Data Mining seminar of Stanford professors Trevor Hastie and Robert Tibshirani. One thing’s for certain: statistical learning/data mining will be a core competency of every data scientist going forward.

In contrast to other statistical seminar curricula which often are overly mathematical, the focus of Statistical Horizons’ instruction seems to be on state-of-the-art methods that can be applied to common, real world data science problems. Most of the instructors are statistically-oriented social scientists who work with the same types of messy data and designs as today’s data scientists. As an exemplar, think of Duncan Watts, sociologist and Yahoo! researcher.

If I could change one thing about the SH instruction, it’d be the choice of statistical platforms to showcase the methods. Depending on the course, the currently-highlighted statistical software might be SAS, SPSS, Stata, open source R – or some combination. In data science, R reigns supreme, while SAS remains the 800 pound statistical gorilla of the business world. SPSS, popular among psychologists, marketing researchers and business strategists, is generally not a choice of data scientists. And Stata is mostly consigned to social science academia, rarely used in business. My druthers would be to see a lot of R and a dash of SAS.

While I haven’t yet participated in a Statistical Horizons’ seminar, I like the current topics and apparent focus of instruction. The testimonials cited on the website are impressive, and at $895 per 2-day course ($795 with early registration), the seminars are competitively priced. Putting it all together, I wouldn’t be hesitant to recommend Missing Data, Social Network Analysis, Propensity Score Analysis, Survival Analysis Using Stata, Longitudinal Data Analysis Using Stata, Statistics with R, and Data Mining to prospective students. For data scientists looking to advance their statistical portfolio, these courses look like good bets.