Politics of Data Models and Mining

Register now

I recently came across an interesting thread, “Is data mining still a sin against the norms of econometrics?”, from the Advanced Business Analytics LinkedIn Discussion Group. The point of departure for the dialog is a paper entitled “Three attitudes towards data mining”, written by couple of academic econometricians.

The data mining “attitudes” range from the extremes that DM techniques are to be avoided like the plague, to one where “data mining is essential and that the only hope that we have of using econometrics to uncover true economic relationships is to be found in the intelligent mining of data.” The authors note that machine learning phobia is currently the norm in economics research.

Why is this? “Data mining is considered reprehensible largely because the world is full of accidental correlations, so that what a search turns up is thought to be more a reflection of what we want to find than what is true about the world.” In contrast, “Econometrics is regarded as hypothesis testing. Only a well specified model should be estimated and if it fails to support the hypothesis, it fails; and the economist should not search for a better specification.”

In other words, econometrics focuses on explanation, expecting its practitioners to generate hypotheses for testing with regression models. ML, on the other hand, obsesses on discovery and prediction, often content to let the data talk directly, without the distraction of “theory.” Just as bad, the results of black-box ML might not be readily interpretable for tests of economic hypotheses.

The traditional regression versus machine learning dialectic was addressed many years ago by the late statistician Leo Breiman, who contrasted the former unfavorably with the latter. "There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets.”

Political scientist Philip Schrodt seems to agree with Breiman, linking much of ‘the current regression malaise to the philosophic obsession with explanation over prediction, for him a false and dangerous dichotomy. With Schrodt you can’t explain what you can’t predict. Regression models, though, are popular with “theorists,” in part because they provide coefficients for the predictor variables, the direction and strength of which are routinely used to accept/reject hypotheses. Contrast the purported explanatory power of linear models with “black box” statistical learning techniques that generally do a better job of prediction at the expense of explanatory power.’

It’s not surprising that economists and political scientists are late to the machine learning party. Eminent Stanford statistician Brad Efron opined in an interview several years ago that ML is more “liberal” than pure statistics which is, in turn, less conservative than applied statistical disciplines such as econometrics and psychometrics. “If data analysis were political, biometrics/econometrics/psychometrics would be 'right wing' conservatives, traditional statistics would be “centrist,” and machine learning would be “left-leaning.” The conservative-liberal scale reflects how orthodox the disciplines are with respect to inference, ranging from very to not at all.”

I still use regression extensively, but have adopted newer techniques such as smoothing, shrinkage and resampling, espoused by Frank Harrell in his excellent book “Regression Modeling Strategies”. Indeed, I think Harrell’s treatment provides a seamless transition from the traditional hypothesis-testing regression I learned many years ago to the modern statistical learning approaches of Hastie, Tibshirani, and Friedman.

For me though, there’s no turning back: regression is now simply just one of many potential ML methods. When it comes to explanation versus prediction in my current work, I’ll choose prediction three times out of four. The pluses that come with new ML models that handle the ever-expanding size of data sets using shrinkage and resampling techniques more than compensate for the minuses surrounding the interpretation of model coefficients.

After much dense academese, I think the "Three attitudes ..." authors agree, defecting to the econometrics dark side of machine learning. “A regulated specification search, such as the general-to specific methodology proposes, is an attempt to use econometrics to bring an economic reality into focus that would otherwise remain hidden. It aims, quite literally, to discover the truth.”

What do readers think? Where do you stand on the traditional regression-machine learning continuum? On explanation vs. prediction with your models? Are you a conservative econometrician? Or a liberal machine learner?

For reprint and licensing requests for this article, click here.