On two occasions I worked with students writing their Ph.D. dissertations in clinical psychology. I remember feeling sorry for their plight of having to do “empirical” studies that included statistical analysis when their interests and training couldn't have been more removed. The students were given “puzzles” to solve by their advisers and then tasked with the statistical design, programming and analyses of the data. The shocker was not that the unwitting students didn't know that stuff, but that their advisers didn't know it either. When one frustrated student's ANOVA came back empty, her adviser “ordered” a regression analysis, apparently unaware that both derived from the same linear model and hence would yield identical results. Both harried students paid me well to “fish” for the statistical significance I was almost certain not to find. As a result of these hapless exercises, I developed a healthy skepticism towards much of what I read in behavioral research.
I was reminded of those frustrations when I came across a trenchant article on the misuses of statistics in the March 27, 2010 ScienceNews, Odds Are, It's Wrong: Science fails to face the shortcomings of statistics. The basis of author Tom Siegfried's argument is that “The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions. Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.” Yikes.
Siegfried identifies a number of problem areas in the application/interpretation of modern statistical techniques. First and foremost is the problematic notion of statistical significance, unfortunately the holy grail of many investigations. “Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition.”
If there are problems with significance testing of a single variable of interest, what about the complications of considering multiple effects simultaneously? “When several drugs are tested at once, or a single drug is tested on several groups, chances of getting a statistically significant but false result rise rapidly.” Fortunately, techniques for simultaneous inference are being developed at top stats schools like Stanford. The appropriate application of those techniques, though, is still a concern.
One of my pet statistical peeves, duly confirmed in the article, is that slaves to statistical significance confuse their “positive” findings with practical worldly importance. “Another common error equates statistical significance to “significance” in the ordinary use of the word. Because of the way statistical formulas work, a study with a very large sample can detect “statistical significance” for a small effect that is meaningless in practical terms.” Indeed, I now always look at the “significance” of interventions visually first and am often satisfied with that alone. For much of the work I do, a statistically significant/practically insignificant result is of no interest. This distinction between statistical and practical significance – relative vs absolute risk – is well chronicled in the delightful book The Illusion of Certainty.
Randomization, the platinum assignment technique to assure that treatment and control groups are “equal” on other potentially confounding variables, can also fail, a victim of probability mathematics. “...statistics do not guarantee an equal distribution any more than they prohibit 10 heads in a row when flipping a penny.” If 95% of randomizations do just fine, that still leaves 5% that don't. And when randomization fails, the risk of plausible alternative explanations to the intervention grows.
Increasingly, “meta-analysis” — combining the results of similar investigations into a seemingly coherent whole — has become fashionable in the statistical world. There are threats to a valid meta-analysis, however. “In principle, combining smaller studies to create a larger sample would allow the tests to detect such small effects. But statistical techniques for doing so are valid only if certain criteria are met. For one thing, all the studies conducted on the drug must be included — published and unpublished. And all the studies should have been performed in a similar way, using the same protocols, definitions, types of patients and doses. When combining studies with differences, it is necessary first to show that those differences would not affect the analysis...”
The author sees hope in the Bayesian approach, with its notion of probability as “degree of belief”. Bayesian statistics, once anathema to a frequentist camp wedded to the “objective” ideal of standard statistics, is now becoming increasingly popular in response to both traditional stats failings and advances in computational efficiency. And the data deluge amply supports the estimation of Bayesian priors, which, along with the sequential updating of posterior probabilities, fits nicely the adaptive, learning mentality of BI.
The article's Bayesian illustration of identifying steroids users in baseball is, almost comically, a sign of the times. If a test for steroid use correctly identifies 95% of culprits, and, based on previous testing, experts have established that about 2 percent of professional baseball players use steroids, what is the probability that a player testing positive is actually a steroids user? The quick answer of .95 is, fortunately, dramatically incorrect. The Bayesian calculation of the correct probability is as follows: pr(steroids user | +test) =pr(steroids user)*pr(+test | steroid user)/( pr(steroids user)*pr(+test | steroid user)+pr(~steroids user)*pr(+test|~steroids user))=.02*.95/(.02*.95+.98*.05) = .28. Let's not be too quick to indict the boys of summer!
The author's critiques of current statistical practice are certainly valid: there's a lot a bad statistical analyses in the scientific and business worlds today. In my assessment, the biggest problems with statistics are not the methods themselves, but their inappropriate application/interpretation by inexperienced or biased practitioners. My antidote for cavalier business statistical practices? A course with esteemed Vanderbilt professor, R elder and statistical practitioner par excellence Frank Harrell.
Steve also blogs at Miller.OpenBI.com.