Editor's note: Read Part 1 of this blog here.
When I was a statistics student “back in the day,” I learned several important precepts that I can recall as if yesterday. The first was the unbridled joy of working with bell curve-distributed data. Bell curve – normal or Gaussian – distributions not only seemed applicable to any number of measurable traits, they were also quite tractable mathematically. Indeed just two parameters, the mean and variance, described all that needed to be known about a normal distribution.
The second was the power of stepwise regression as the primary tool for predictive modeling. Just get your data set, identify the independent and dependent variables, and let SR loose. Test the regression coefficients on the same data used to train and don't worry about overfitting. Statistical significance is your friend.
Finally, probability and statistics were much more mathematical than empirical disciplines. One made assumptions about the behavior of her models, then derived the appropriate results mathematically. Theory trumped applied; real stats guys wrote proofs. Computers, mostly mainframe, simply handled the computations carefully articulated by the math. If the assumptions for the models weren't correct, well …
The commonality behind these observations? If all you have is a hammer, everything looks like a nail. As it turns out, though, the above dictums were often limiting or even incorrect. As many investment management firms can attest, assuming normally distributed portfolio returns is a big – and costly – mistake. “Black Swans” and other extreme performance is much more common than would be predicted by a Gaussian distribution. The tails of real world returns are a good deal “fatter” than those from a normal distribution.
The state of the art stepwise regression I learned from the guy who wrote the book is now almost completely discredited. Today's techniques are much more guarded, relying on algorithms that “shrink” significant coefficients towards zero, and on a less credulous train/tune/test division of data for model validation.
Perhaps the biggest changes in statistics over the last 30 years, though, have come from the ascent of computing and the emergence of computational statistics as a serious branch of inquiry. The work of distinguished Stanford statistician and bootstrap originator Brad Efron has helped drive the evolution of statistics from primarily a mathematical discipline to one with a central focus on computation.
In an interview two years ago, Efron opined: “Computation is now the third leg of a statistical triangle that also includes applications and mathematics … At this time, the bootstrap is a standard tool in the statistician’s arsenal, opening up predictions, standard errors, confidence intervals, etc. to ready computation – allowing practitioners to bypass the often-arduous mathematics. It provides for immediate statistical gratification and for quick communication. The computer does the theory; the statistician needn’t worry about the math.” Much more computer savvy than mathematically sophisticated, I'm a poster child for the use of simulation and Monte Carlo techniques such as the bootstrap, permutation tests and cross-validation, to solve applied problems in both probability and statistics.
In Part 9 of “The Flaw of Averages,” author Sam Savage presents the probability management approach to risk modeling, replacing “Steam Era” statistics with “recent advances in both computation and data storage ... to the point that probability distributions … may now be manipulated like everyday numbers.” (Read part 1 of my Flaw of Averages column here.)
Common starting points for both probability management and computational statistics are statistical graphics packages available today that would make “Exploratory Data Analysis” author John Tukey proud. I would counter Savage's illustration of good graphics from SAS JMP software with the lattice package of the R project and the live visualizations from Spotfire. I'd also supplement study of guru Edward Tufte with the more practical approaches of William Cleveland.
The fundamental enabler of probability management is what FOA calls interactive simulation “in which … thousands of Monte Carlo trials occur simultaneously when some input to the model is changed … Tied to computer graphics, these applications are providing Mindles that would make John Tukey proud.” Interactive simulation is “the new lightbulb for illuminating uncertainty.” I would add that simulation is now at the heart of many applied statistical techniques. And the current platform that best supports interactive simulation for Savage? None other than Microsoft Excel with add-on packages such as Risk Solver.
If interactive simulation is the lightbulb of PM, then scenario libraries are the power grid. The strength of scenario libraries lies in SLURP – scenario library unit with relationships preserved – where the simulation of the sum equals the sum of the simulations. FOA describes an example of a bank with two divisions, Real Estate Investment and Home Loans, exposed to the housing market. An analyst would first generate a distribution of HM conditions, a stochastic information packet or SIP, through a Monte Carlo simulation. She would then “feed” that distribution to each of the two divisions, resulting in two output SIPs.
“The final step is just to add the output SIPs of the two divisions together to create the SIP of total profit … In this example, we consolidated the distributions across two divisions and then rolled up the sum to headquarters.” The difference between the rolled up versus independent division assessment of risk: “the chance of losing money is not 1 in 18 after all, but 1 in 3, six times greater!”
While interactive simulation and scenario libraries provide the foundation for PM, one problem remains: how to store the voluminous simulation/computations for easy subsequent access. The answer for Savage is Distribution Strings (DIST), “which encapsulates the 1,000 or even 10,000 numbers in a SIP into a single data element.” The challenge for a new data structure like a DIST is, of course, development of a standard that is recognized industrywide. Fortunately, that development is underway, with DIST 1.0 established in July 2008, supported by, among others, Microsoft, SAS and Oracle. And software implementations based on DISTs are now available from vendors such as Risk Solver, Analytica, Crystal Ball and JMP from SAS. I'm just beginning to investigate a package for incorporating DISTs into R.
The potential for a methodology like PM with DISTs to revolutionize risk management seems very high. “We are seeing increasing interest in enterprise wide risk simulation … Companies are realizing that backward-looking analysis of historical data is insufficient on its own to drive company strategy. They are asking instead for forward-looking planning and analysis through simulation.” It appears that most of the tools needed to support that effort are available now.