When I first arrived to the work world 30 years ago, I encountered two roadblocks to my enthusiasm for mathematical and statistical optimization techniques in business. The first was a data management problem. Collecting, integrating, organizing and manipulating data was a very thorny, sometimes intractable task, consuming almost all analytical energy. Technology to facilitate data management, both hardware and software, was just beginning to evolve from the mainframe/COBOL/network database paradigm. Machine cycles were rationed, storage was scarce and expensive, programming was low level and data was unreliable. Probably 98 percent of statistical effort revolved on building trustworthy data sets. While data quality issues persist to the present, many of the other hardware and software problems have been solved. In fact, the ascent of mini/micro/personal computers with UNIX and Windows as well as the emergence of relational databases should probably be heralded as fundamental enablers of modern business intelligence.
A second impedance was more subtle. I came out of school very excited about the rigorous statistical techniques I'd learned. Linear models, multiple and logistic regression, categorical data analysis, time series models, econometric models, multivariate techniques such as discriminant analysis, canonical correlation, etc., certainly worked well in research settings; why was it so difficult to make the translation to the new demands of business? Was there something fundamentally different about analysis in a business setting? Was the rigor of research missing in business? My conclusion was that I was dealing at that point with a different problem domain. The statistical models I'd learned focused on testing or confirming hypotheses, whereas in the business world I needed approaches for developing or formulating hypotheses. Techniques to help discover relationships were missing. Thankfully, help was on the way.
In the late 1970s and early '80s, a new set of methods started to emerge from the statistical world. Championed by John Tukey of Harvard, this collective approach became known as exploratory data analysis (EDA). The differences between EDA and "traditional" statistical inference techniques were quite provocative at the time. Whereas traditional techniques were "top down" - quantitatively testing models built on stringent underlying assumptions about the data - EDA was more "bottom up" - working with data informally and graphically for discovery, with little in the way of preconceptions. Traditional or confirmatory statistics was all about models, equations, estimates, confidence intervals and formal tests; EDA focused on flexible and detailed examination of data, especially visual, with no underlying assumptions and no clearly articulated outputs. Unencumbered EDA techniques discovered patterns through clues in the data, suggesting models for subsequent analysis while uncovering departures from critical model assumptions. Traditional or confirmatory methods stressed evaluating the evidence from tests of statistical models. In a sense the approaches competed, but, more importantly, traditional and EDA came to be viewed as reinforcing compliments. An approach that combines iterations of both philosophies is now considered de rigeur.
Five themes, the famous "R's," came to distinguish the EDA approach and now showcase EDA's links to BI. First is revelation, the use of basic graphs and displays to examine data. EDA has popularized individual distribution displays such as stem-and-leaf and probability plots, as well as summary graphics such as box-and-whiskers. EDA has also highlighted techniques of combining graphs for enhanced visual appeal. At times EDA and graphical analysis are equated. Though simplistic, the emphasis on visualization and perception is a significant differentiator of EDA and a major contribution to modern business intelligence (BI).
Second is re-expression, the determination of optimal scale for showing a variable's DNA. A data element might be transformed by logarithm or power functions, for example, to dampen variation, change a multiplicative scale to additive, or change an exponential relationship to linear. The transformed variable generally behaves more "regularly" than the original, and is often more suitable for specific methods.
Residuals, the difference between actual values and those predicted by a fitted model, reflect EDA's obsession with goodness of fit and the viability of underlying assumptions. An examination of residuals through various displays often "reveals" a very compelling story about the fit and assumptions of a model.
Resistance has to do with how methods react to extreme values or outliers in the data. A resistant statistic is little impacted by outliers; its value will not change dramatically with the inclusion or exclusion of a few values, even extreme ones. Order statistics such as the median and interquartile range are examples of resistant estimators. The mean, standard deviation, correlation coefficient or other "moment" statistics, on the other hand, are not resistant and can change dramatically with just a few disparate observations. Because data quality is a constant concern in statistics and intelligence, the use of resistant methods has become an especially high priority for BI.
Robustness is the sensitivity of statistical models to violations of the underlying assumptions. A statistic or procedure is robust if assumptions can be relaxed without invalidating results. Unfortunately, lack of robustness appears more the rule than exception with many traditional statistical models. Fortunately, however, in large part because of visibility engendered by focus on robustness, newer techniques often come with few restrictive underlying assumptions.
EDA has proven itself a bellwether in the evolution of business intelligence. With its bottom-up emphasis on the R's, EDA should at least be partially credited with many of the techniques that are at the core of today's discovery-focused BI. Indeed, many current methods are exclusively exploratory, with the intent of discerning relationships for subsequent testing - and with no assumptions about the underlying distribution of data. The prevalence of order statistics and quantile distributions in BI is due in large part to the compelling argument for resistant methods. The popularization of robust predictive, clustering, and mining techniques also pays homage to EDA. Perhaps EDA's biggest contribution to BI is in its compulsion for graphs and visual displays. EDA has introduced a number of raw data and summary displays that have become mainstays of basic data analyses. And, of course, each of these and other EDA graphs can be viewed across perceptually pleasing dimension or panel variables for further insight. These core contributions assure that EDA's influence on BI will stand the test of time.