I recently received an email from a young man who played on the little league baseball team I coached 12 years ago. Mike was one of my favorites, and we always catch up when meeting by happenstance in town. He must have a great memory, recalling when last I saw him that I told the team way back when I did quants work for a living.

Mike’s just now completing his Masters in a science field, putting the finishing touches on a capstone project paper. Would I be willing to review it and let him know if the work were up to standard, he wanted to know?  Flattered and determined to stay relevant with Millennials, I immediately agreed.

The next morning in my email was a note introducing the included Word doc and, to my surprise, a spreadsheet with the underlying data. The 25 page paper was nicely done, very well-written to the outline of executive summary, introduction, review of literature, hypotheses, description of data, methods and analytics, statistical results and conclusions. The data set consisted of 15 variables and about 1500 cases. The statistical results section showcased a wealth of means, standard deviations and correlation coefficients, along with several multiple regression models with asterisks denoting “significant” findings. I’d have been proud to submit a paper of this quality when I was in grad school.

After I finished reading, I just had to take a look at the data, so I loaded it into both Tableau and R and began poking around. What a difference powerful visualization tools like these make for researchers! In just minutes I was able to start making some sense of a topic and data set that were completely foreign.

Once I’d built the R data frame, I randomly sub-divided it into “exploratory” and “test” subsets of about 1000 and 500 cases respectively. Using R’s lattice graphics, I then generated histograms for all categorical variables and combination histogram/density plots for numerics in exploratory. From there, I looked at density/histograms broken down by categorical variables, seeking evidence of “by group” interactions.

Next up were scatter plots of numeric variables, first simple and then by group. The trellis and grouping features of lattice make these more complicated graphs easy to generate. Throw in the capability to fit local, polynomial surfaces, and you have the makings of powerful statistical visualizations. One telling attempt at the end involved no less than five variables, but was comprehensible due to judicious use of color, trellises and fitted curves. The patterns I thought I found with the exploratory data set I pretty much confirmed with test.

What the visualizations suggested to me is that the features Mike’s analyses hypothesized/tested as significant were indeed so, but that some of the relationships might in fact be non-linear and there may well be interaction effects in the data not captured in the regression models. And I suspect that by training and testing on the same data, the reported models might be somewhat overfit -- the actual relationships not as strong as those reported.
These observations are in no way a critique Mike’s research. My assessment is that he’s done an outstanding job on the capstone project, producing methodical research, careful analysis and a well-written document communicating his work. I’d be willing to bet Mike dutifully followed the script of how to do research articulated by his graduate program. And I’d be shocked if he didn’t get an “A” in the class. In the end, I’m gratified there’s another talented data-driver set to join the work force!

I can’t help, however, but think about this in the context of the theory versus statistical learning approach to the predictive modeling debate I reported on recently. It just seems that trying to make a priori theoretical sense of the relationships of over a dozen attributes to an outcome is too complicated an undertaking. For observational analyses like this one, perhaps a conservatively-implemented, statistical learning approach that listens to the data might be a better choice. Rather than the analyst, let the data do the theorizing.

Count me unapologetically in the data-driven, statistical learning camp. Especially for non-experimental analyses, my take is to use theory to suggest features of interest, but deploy powerful graphical techniques and SL methodologies like those presented in Applied Predictive Modeling to hone in on relationship details.  Rigorously confirm the analyses by testing the findings via cross validation and held-out data. If, later on, the opportunity presents to implement a theory-driven experiment to demonstrate cause and effect, so much the better.