It's time for the OpenBI Forum (Forum) to catch our breath for a holiday column after two high-energy discussions with Gary King, even as we anticipate a follow-up interview in the winter months. Appropriate for an end-of-year article is a summary of our accomplishments in 2006 and plans for 2007.
The Forum's initial charter was really one without boundaries - a blank slate for chronicling the BI world. We promised to discuss BI in light of business, information technology, quantitative methods, social science, philosophy and even portfolio management. The Forum wished to provide something for both techs and quants as well as for business and management, with both theory and practical applications. For our work in 2006, I would give us a grade of incomplete - a nice start but an unresolved finish.
The Forum has written on open source BI (not very surprising, given our company focus!), the R statistical package and several applications of R's graphics capabilities. We wrote a tongue-in-cheek column on performance management (Gary Cokins needn't fret.) and paid homage to exploratory data analysis (EDA), the precursor of much of the best of BI. And we ended the year with the first two installments in a three-part interview series with Gary King of Harvard on BI and the quantitative social sciences.
A common theme of each column is that OpenBI Forum is not on a traditional business intelligence (BI) thought leadership track. Having been a part of the BI community for more years than I care to acknowledge, I'm a big fan of many of BI's thought leaders, almost incredulous at how they can continue to deliver such consistent high quality - column after column, book after book, year after year. The Forum feels, however, that with the explosive growth of BI has come a bit of stagnancy to much of the "literature" that proliferates in the many BI channels, creating an "echo chamber" effect of increasing sameness. At the same time, the growing importance of BI for business performance has never been more uniformly confirmed. The OpenBI Forum thinks there's untapped potential from many of the "cousin" disciplines of BI, especially the academic worlds of open source, statistics, business, quantitative social science, computer science, management science, decision science, information science, etc. that can progress BI even further and quicker than today. Our goal in 2007 is to help expand the reach of BI to this outside world, finding innovative ideas that spawn from a cross-discipline perspective.
The wisdom that Gary King offers BI is, the Forum believes, a great example of such outreach. A major focus of Gary's work is on methodologies, particularly quantitative ones, for social science research. These techniques help to ensure the validity of study designs so that researchers can be confident in their assertions and findings. Using methods developed by Gary and his colleagues makes it easier to prove or disprove hypotheses and theories about human behavior. Techniques such as randomized field experiments, the systematic treatment of missing data, statistical handling of ecological inference problems arising from group data, the statistical treatment of rare events and knowledgeable survey designs help enhance the quality and validity of research.
The explosion in the use of quantitative techniques for decision-making noted by Gary is, of course, equally beneficial for business, in much the same way. After all, business is a social endeavor, and most of BI focuses on predicting and understanding behavior. Certainly, the validity of intelligence findings is just as pertinent for BI as it is for academic study. How often have we ignored the problems of missing information in our data marts, simply making the dangerous assumption that missing data cases look like the non-missing? How often have we constructed customer surveys that may, in fact, have been very flawed, providing us spurious information? How often have we missed opportunities to conduct randomized experiments, the platinum standard of designs? In the end, BI analysts are searching for the same thing as their academic cousins: confidence in the validity of their findings predicting human behavior.
The OpenBI Forum closes out 2006 the same way with started - with a few graphs. The Forum looks forward to getting started again in early 2007!
Time as an Ecological Group
In our November interview, Gary King discussed the ecological inference problem and the difficulties it can bring to analysis. An ecological fallacy occurs when a relationship between variables at a group level is imputed to the individuals within the groups - erroneously. Gary gives the example of using ZIP code sales information as the foundation for a marketing campaign. Given an association of lucrative sales in high income ZIP codes, a marketing organization might conclude that residents with higher incomes were the buyers and should be targeted, when in fact it is the lower income families within the ZIP code that are the prime buying candidates. Since the marketers have buying data at only the ZIP level, they cannot with certainty infer behavior to individuals, despite the correlation of strong sales with high income ZIP codes. The challenge for marketers in situations like this is analyzing information at a group level with statistical methods that can sidestep some of these problems and produce the best individual targets for their scarce campaign dollars.
Stock market investors may well be guilty of similar mistaken thinking in their buy/sell behavior, with time as the key grouping variable. With information on the performance of portfolios at five and 10-year periods, investors sometimes mistakenly conclude that all sub-periods behave similarly and project this into the future, when in fact the performance can look quite different. The time periods surrounding the Internet bubble burst of March 2000 offer a provocative illustration. Consider first the graph in Figure 1, produced with the R lattice package. The panels in the graph depict the growth of $1 invested from July 22,1993 through December 12, 2006 with different company size and value portfolios. The graphs are ordered left to right by company size; within each panel are color-coded "growth," "neutral" and "value" portfolio performance sketches. Note the bubble in the center of each, especially the larger growth portfolios. Note also how growth companies dominated performance in the first half of each panel, then gave way to value in the second half. Similarly, it is apparent that larger companies had more success early in the time period, while smaller companies had better returns later on. With this sequence of graphs, however, the magnitude of difference in performance over time is not readily apparent.
Now look at Figure 2, which divides the performance into two equal time frames, one "before" the bubble burst, the other after - thus showing time itself as a grouping variable. These separate panels depict the contrast in performance to those detailed above much more starkly. The top row clearly shows the market euphoria of the mid to late nineties and the dominance of both large and growth companies in that period. The second panel offers a sobering reminder that times change in the market. Four of the growth portfolios are still under water almost seven years later. Along with much more modest returns in the second time frame is a reversion that has value and small company performance in the lead. Finally, the impact of the return differences over the two time periods is multiplicative, not additive. The lesson for investors is that time itself may lead to fallacious reasoning on portfolio performance: what appears true for a given time period may in fact have very different looks in sub-periods.
A cautionary tale.
Which Major College Football Conference is Best - Academically?
I came home one Saturday afternoon in late August to find my son and a cadre of his 10th grade jock buddies "discussing" the new college football season. Each was actively promoting his own chosen team(s), while denigrating those of the others. And if and when it became clear that one choice dominated another, the response was generally one of, "What do you expect? That school has such low academic standards, they admit anyone with a pulse!" In a particularly trashy exchange, one of the kids asked me to adjudicate - which was the better school? Rather than engage in a no-win proposition, I gave the group the 2007 US News America's Best Colleges guide and let them settle the issue themselves.
About two months later, my daughter asked if I could help in a little data analysis project to complete the stats module for her math class. The students were working on descriptive statistics - mean, median, percentiles, etc. - and had used stem and leaf plots as well as box and whiskers to help summarize data. Stats guy and sports fan that I am, I recommended they take data from the latest Colleges guide and compare major football conferences on the academic ratings of member schools using a graphical summarization technique for their project.
We first decided on the relevant football conferences to compare, settling on the ACC, Big East, Big 10, Big 12, Pac 10 and SEC. Over the past 50+ years, all Division I champions with the exception of Notre Dame for several years and Brigham Young for one year came from schools in these conferences. Of course the composition of the conferences changes often, so we locked in on what they look like today. Also, the Big East has different members for basketball than for football, with Notre Dame, Georgetown, Villanova, Marquette, Seton Hall, DePaul and Providence basketball members but not football, and thus not part of this analysis.
We used the Peer Assessment Scores (1-5 scale) of each school as detailed in US News (though many have little confidence in these numbers) as a measure of school quality, and sought to summarize conference "performance" in a meaningful way. We decided on a modified dimensional box plot to tell our story. Figure 3 shows the graph, also a lattice developed in R. (Passionate college football readers: please don't shoot the messenger!)
Conferences are presented from top to bottom based on their median peer ratings. Within each conference box plot "cartridge" is a well-summarized distribution of scores. Look, for example, at the Big 10. The dot represents the conference mean, the vertical blue line, the median. Surrounding the median is the first rectangle that shows 25 percent of the distribution - the 37.5th through 62.5th percentiles. The second inclusive rectangle, known as the interquartile range, holds 50 percent of the ratings, from the 25th through the 75th percentiles. Similarly the third rectangle includes percentiles 12.5 through 87.5, the fourth includes 5 through 95, and the final line includes the entire range of observations, 0 through 100. While we like the range values presented here, the procedure is flexible, allowing the analyst to choose ranges based on needs. There's also an option to overlay individual values within the charts, but we felt that would clutter the presentation in this case.
In a comparison of the Big 10 and Pac 10, it's clear that the Pac 10 presents a much wider range of ratings than the Big 10, both high and low. Indeed, a distinguishing characteristic of the Big 10 compared with the other conferences is the small range of ratings coupled with a relatively high minimum. Though one could argue over different ways of determining which conference is "best" - median, top five, lowest quartile, etc. - the data are summarized very elegantly for both within and between conference comparisons and can support a variety of different perspectives.
While this application is admittedly somewhat frivolous, it illustrates the power of a graph that both neatly summarizes data and allows for cross-group comparison. Add in the ability to generate such graphs "dimensioned" by values in yet other variables, and you can begin to see the opportunities for BI. In 2007, the OpenBI Forum will present real customer applications of such visuals.
Have a Happy Holiday!