One of the big changes in OpenBI's business over our eight years of BI and analytics consulting has been the evolution of “reporting”. In 2006, print-like reports were the norm, with OLAP and visualization secondary. Now OLAP and graphical dashboards predominate. Indeed, many customers have pretty much abandoned traditional “pixel-perfect”, control breaks reporting for visual drill/slice-and-dice, dashboards and statistical graphs.
Nobody's happier about that development than me. I love exploring data with Tableau and generating statistical graphics with R's lattice and ggplot2 packages. There are advantages and disadvantages to both Tableau and R: Tableau's easier to use and is much more visually appealing; lattice/ggplot2's programmability and links to R functions offer significant statistical benefits for R analysts. And of course there's the Tableau-R integration.
Recently I've been deploying scatter plot matrices in both Tableau and R. Sploms are collections of 2-way scatters among three or more variables. Consider the simple Tableau illustration in Figure 1. The three variables in the matrix I3000, I3000G and I3000V represent daily percent changes in three Russell 3000 stock indexes over a period of almost 19 years. The matrix is symmetric the upper right portion is simply a reflection of the lower left. The upper left to lower right diagonal depicts the perfect straight line scatters of the variables with themselves. The diagonal and half of the matrix could thus be removed with no loss of information. What are the scatters saying? In investment speak, I3000V and I3000G appear less correlated than other pairs, so perhaps including the two in an portfolio would provide diversification.
Those who've read OpenBI's recent “Data Science and NCAA Bracketology Part 2” blog were also exposed to a splom programmed with the R lattice graphics package. In this visual, we investigate all two-way scatters of nine different college basketball ratings indexes used to rank teams. As Figure 2 illustrates, those indexes are highly-correlated, with tight, positively-sloped scatters.
This bracketology graph is, however, pretty far removed from the default splom. lattice provides the capability through “panel” functions of re-programming individual scatters. In this case, we eliminate the redundant upper half of the matrix, and replace the uninformative self-scatter diagonal with useful density plots that show the distributions of individual indexes. In addition, we add a “smoother” curve to each scatter and report the correlation coefficients as well. We've been told the visual works well lots of information, but consummable and not overly complex.
One early response to the blog noted that a variant of our splom could have been produced with less programming using R's ggplot2 and GGally packages. I love the layering metaphor of ggplot2 and was indeed able to produce a very similar result with less code though I was stumped on managing the size of labels. It's nice having two highly-functional R graphics packages to work with.
By it's nature, the scatter plot matrix is used to show relationships among numeric variables. What happens when one or more of the attributes is categorical or factor? GGally provides the capability to handle factor variables directly, though not necessarily the way I'd like to see them, while lattice must be programmed.
So I've decided to build an R-based splom function (either in lattice or GGally or both) that handles both numeric and factor variables simultaneously. Done well, this could be a very handy exploration tool.
numeric-numeric scatters will look much like in Figure 2. My thinking now is that for factor-factor pairs, I'll use mosaic plots for crosstabs, with frequency dot plots along the diagonal. factor-numeric pairs will perhaps be represented with strip plots.
I'll write a follow-up blog when I have something to show.