I participate in a data mining/predictive modeling discussion group with a major social networking web site. Recently, a topic entitled “Misconceptions about statistics” surfaced, the reaction to an article in the BI media. That post had leveled several criticisms at statistical methods for predictive modeling. Among them:
“Traditional statistical analysis is often of limited value. It is not that these tools are somehow flawed. Rather, it is that they are overly simplistic and, in many cases inappropriate for the task of modeling human behavior.”
“Traditional statistical techniques are overly simplistic as they are suitable for only the most basic support of our decision making. They typically assume that the interactions in our decision variables are independent of each other, when, in fact, we are bombarded with multiple inputs that are highly interrelated.”
“Additionally, these simple modeling techniques generally attempt to build linear relationships between the inputs and the desired output. It is often the case that the basic recognition of the non-linear aspects of a solution space will generate improved decision making.”
The advanced modeling tools used in data mining are not “better” tools. They are simply better suited to modeling the realities of human behavior.“
The reactions from the group were generally negative, calling out what were considered erroneous assumptions by the author on normality, independence and linearity. He was also chastised for the limited and naïve domain of models under consideration.
My take on the article was less literal. I think the author's point that traditional statistical models might not be up to the task of predicting the realities of human behavior is quite valid. In fact, one of the giants of the statistical world, the late UC Berkeley professor Leo Breiman, originator of Classification and Regression Trees (CART) and Random Forests, said as much in a provocative 2001 article, Statistical Modeling: The Two Cultures, in which he criticized the statistics status quo. The abstract for this paper is telling:
"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.”
In Breiman's critique, of course, stochastic data models are the grist of traditional statistics, while algorithmic models have emerged from computer/mathematical science and other disciplines such as psychology and biology. The exciting field of statistical learning, with champions at the elite stats department of Stanford University, now seems a promising compromise between the two camps.
Eminent Stanford statistics professor and Breiman colleague Brad Efron shared his wisdom on the topic with me last year. For Efron, statistics is the first and most successful information science, a small discipline concerned with careful analysis of inference. If we abandon statistical inference, we’re doomed to reinvent it.
At the same time, traditional statistics is experiencing only modest growth, with more and more analysis being done in scientific disciplines, giving rise to biometrics/biostatistics/bioinformatics, econometrics, psychometrics, statistical physics, etc. And then there are the computer science/artificial intelligence contributions of machine learning and other data mining techniques. Efron believes the competition and collaboration between statistics, the scientific disciplines and computer science can rapidly advance predictive science, much to the benefit of business.
The challenge for today's modelers is to assimilate the vast and ever-growing body of predictive science knowledge from all the disparate fields, choosing the best of each for their tool chests. At the same time, the core curricula for Statistics MS programs at top schools like Stanford, Chicago and Wisconsin are not unlike they were years ago – with emphasis on math, probability theory, statistical theory and mainstream statistical methods -- except now with a very healthy dose of computation. On the positive side, Bayesian analysis is more accepted than in the past, a welcome development for BI. My guess, though, is that depending on choice of electives, one could get an advanced degree from a top school with just limited exposure to the descriptive and predictive learning algorithms increasingly popular in business.
It's not enough to be a statistician, psychometrician or computer scientist in today's business quants world. Predictive modelers must now be data jocks and computational experts, touch many quantitative disciplines and be committed to lifelong learning. And they must increasingly contribute in the larger data warehousing and BI contexts of business. Daunting but exciting challenges, indeed.
Steve Miller also blogs at miller.openbi.com.