I’ve just started my holiday shopping. I probably do two thirds of my purchases online with Amazon, satisfying family requests for books, DVDs, games and electronics. I get my money’s worth from a Prime account, generally including one or more books for myself with each gift order.
On the recommendation of a colleague, I added “Thinking Statistically,” a 40-page “book” by Uri Bram, to a latest batch of gifts. TS took me all of an hour to complete, but I found it both an enjoyable read and a refreshing review. Bram applies statistical concepts to everyday life situations and succeeds in illustrating how we often err in cognition by not using statistical logic correctly. This little book sits between mathematically-oriented statistical texts and the pioneering work on judgment and decision making by psychologists such as Nobel Laureate Daniel Kahneman.
“Thinking Statistically” is comprised of just three chapters, one each for selection bias, endogeneity and Bayes’ rule. Followers of Open Thoughts on Analytics will note that bias and Bayes are frequent topics of the blog.
Selection bias occurs when a sample of observations we wish to make inferences on is systematically different from the population it purports to represent. One way to “protect” against the risk of such a difference is to assure that the sample is drawn randomly from the known population. If a random sample is impractical, we can often “fix” bias by adjusting for systematic differences with the population. “The real problems occur when our sample is biased and we fail to account for that.” Surveys run by BI vendors on their web sites that point to high demand for their products are illustrations of selection bias. And analyses with significant missing data that blithely assume such data are missing at random exhibit bias as well. Takeaways? Be very attentive to analytical designs such as the randomized experiment to attenuate selection bias.
Endogeneity is another form of bias that manifests in predictive modeling from non-randomized “observational” analyses. In most “supervised” situations, the response or dependent variable is presumed to be a function of a number of feature or independent variables plus a random disturbance. If the disturbance is in fact not random – if it is correlated with either included or omitted features -- then bias in parameter estimates is likely, potentially invalidating the model.
Bram cites the simple endogeneity-plagued example of predicting college G.P.A. as a function of ability and effort. That model omits an important predictor of G.P.A. – an index of ease of courses taken. Without that indicator, it could well be the case that “the difference between G.P.A. values predicted ... and actual GPA outcomes is not at all random” but in fact strongly related to the ease index. Statisticians and econometricians have proposed many solutions for the endogeneity problem over the years. Some involve estimating multiply-staged models that “cleanse” the effects of the offending variables.
As Daniel Kahneman argues convincingly in his tour de force, “Thinking Fast and Slow,” humans are both prone to systematic biases in judgment/decision making and also lousy Bayesian probability practitioners. Bayes’ rule states that the probability of a hypothesis given or conditioned on the evidence, P(H|E), is equal to the probability of the hypothesis times the probability of evidence given the hypothesis -- all divided by the probability of evidence, P(H)* P(E|H)/P(E). P(E) can in many cases be reduced to P(H)*P(E|H)+P(~H)*P(E|~H). In Bayesian parlance, P(H|E) is the posterior probability, P(H) is the prior probability or base rate, and P(E|H)/P(E) the normalized likelihood ratio.
Proper application of Bayes’ rule trips up even the most educated. Consider the following example. Suppose a friend gets the unfortunate news she’s tested positive for a serious disease. WebMD research shows that the incidence of the disease indicated by the test is 1/1000, or 0.1%. You also discover that 99% of those with the disease return a positive test, and that only 5% of those who don’t have the disease test positive.
Given those numbers, if you think the prospects for your friend are bleak, you have lots of company – including the brightest medical students. Proper application of Bayes’ rule, fortunately, reveals a much less ominous scenario for your friend. The probability that she has the disease given the positive test, P(H|E), is: .001*.99(.001*.99 + .999*.05) = .019 – about 2%. The statistical gods are looking over her!
What trips up non-Bayesians is the low prior or base rate of 0.1%. Change it to 1% instead and the conditional probability of disease is a much higher 16.7%. Ratchet the base rate up to 10%, and the conditional probability jumps to almost 69%. The moral of the story: be attentive to application of Bayes’ rule and especially mindful of base rates.