Monte Carlo, Resampling and BI – Part 1
About five years ago, a Newsweek review of the then new iPod raised a concern that the shuffle feature might not be random. The author noted that of the 3,000 songs in his playlist, 3 tracks from the same album on more than one occasion ended up in his 120 song autofill: “This seemed to defy the odds,”, he opined.
Now I'm sure there were many algebraic calculations by probabilists hither and yon on the likelihood of those iPod occurrences after the article's publication. Several professors from the highly-rated statistics department at Iowa State University, in contrast, offered a different yet quite simple analysis using computer simulation rather than mathematics to arrive at their conclusion. Assuming that the 3,000 tune playlist was apportioned across 250 “albums”, with 12 songs per album, the authors deployed the R statistical package to generate 10,000 trials of 120-case random samples against a 250*12 vector consisting of 12 1's, 12 2's,....,12 250's -- tracking how many yielded three or more “album” hits. Just under 9,500 of the 10,000 trials met the criteria – almost 95%! Rather then an aberration, the 3+ album runs were the norm. And rather than torture themselves with probability calculations, the professors had simply finessed an estimate using easily accessible computer power.
Humans are often even more befuddled by conditional probability – the likelihood of event E given the occurrence of another event, or given access to additional information. A baseball player might hit 300 for a season, but only 150 when he's behind in the count 2 strikes and no balls. His probability of getting a hit conditioned on an 0-2 count is 150.
Consider the following hypothetical medical situation. You've just tested positive for a very serious disease that afflicts 1 person in 1,000. You've been told the false positive rate – those who test positive but don't have the disease – is 5 per 100. The false negative rate – those who have the disease but test negative – is 1%. What are the chances you have the dreaded disease, given you've tested positive?
Behavioral economists Tversky and Kahnemann found that even among highly-educated medical professionals such as students and staff at Harvard Medical School, “the most common response, given by almost half of the participants, was 95%,” – very much the wrong answer. Fortunately for you, the correct use of the under-appreciated Bayes rule for determining conditional probability leads to a more sanguine conclusion. The probability of disease, given the positive test is, in this case: .001*.99/(.001*.99+.05*.999) = .0194, or about 2%! The crazy guesstimates are tripped up by the low incidence of disease in contrast to the number of false positives.
If you have trouble with probability calculations like this, author and educator Julian Simon shares your pain. In 1997, Simon wrote an easy-to-read book: Resampling, The New Statistics, that uses simulation techniques like the one deployed in the iPhone example to tackle a broad range of probability and statistics problems. There've been many new developments in simulation capabilities over the twelve years since the publication of Simon's book, but its message is still germane: whether driven from coin tossing, balls in urns, random number tables, or random number generation in computer languages, simulation techniques can be used as a foundation for solving probability and statistical problems, often replacing arduous mathematics with computer finesse.
Call it simulation, Monte Carlo, resampling – whatever – I like the thinking, as do many statistics teaching professionals. And though there's not a consensus of definitions on the different flavors of simulation in the statistical world, my favorite, the Monte Carlo method, is generally characterized by:
- A defined domain of inputs
- A process for generating inputs randomly from this domain using a probability distribution
- A calculation on the generated inputs
- An aggregation of the calculated results across a large number of iterations
As an illustration of the method, consider a simple algorithm for simulating the likelihood of disease given a positive test result with probabilities characterized in the example above:
Repeat the process 100,000 times.
Randomly sample a number between 1 and 1000
If that number is the one indicating disease
Randomly sample a number between 1 and 100
If that number is the one designating false negative
Increment the false negative count
Increment the disease count
Randomly sample a number between 1 and 100
If that number is one of the five indicating false positive
Increment the false positive count
Increment the ok count
At the conclusion of the “experiment” of 100,000 iterations, there are 4 accumulated counts: ok, false negative, disease and false positive. The probability of disease given a positive test is estimated by 100*disease/(disease + false positive). In my latest run of the simulation, the results were 100*102/(102+4934) = 2.03%. What is missed by most making an off-hand estimate is that the number of false positives dwarfs the number of disease by an order and a half of magnitude, making the computed ratio small. The false negative count, .001*.01, is inconsequential.
For those interested in a worked-out example of the simulation approach to solving probability puzzles, there's sample code for the “Smith College diploma problem” by the authors of the new book SAS and R. And for those looking to test their simulation programming mettle solving tough probability problems, I'd recommend Frederick Mosteller's classic: Fifty Challenging Problems in Probability with Solutions.
Next week we'll look at examples of Monte Carlo using sample data as the basis for statistical estimates with techniques like permutation tests and the bootstrap.
Steve Miller also blogs at miller.openbi.com.