While recently conducting my periodic review of the Social Science Statistics Blog from Harvard’s Institute for Quantitative Social Science, I came across several intriguing presentations, including one entitled “Detecting Novel Bivariate Relationships in Large Data Sets” by Ph.D./M.D. student David Reshef.
In the brief abstract, Reshef introduces the maximal information coefficient (MIC) as a measure of covariation between two variables that captures “a wide range of associations … and assigns similar scores to relationships with similar noise levels.” He adds that “MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships.”
The SSSB directed me to the MINE website which, in turn, led to the seminal “Detecting …” article in Science, accessible on the Web. That article describes the motivation for the work as the existence of data sets with “hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs – far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones? … One way to begin exploring a large data set is to search for pairs of variables that are closely associated. To do this, we could calculate some measure of dependence for each pair, rank the pairs by their scores, and examine the top-scoring pairs.”
The MIC technique purportedly does precisely that, born from the idea “that if a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship.” MIC, in turn, gives rise to MINE statistics that “can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity.”
My statistical appetite whetted, I revisited the MINE website and downloaded the data sets referenced in the paper along with Java and R implementation code. After getting the scripts to work on my notebook, I ran through the baseball data discussed in the paper, then exercised the engine on several small, “usual suspect” R data sets. Output from R MINE execution at this point is simply a text file detailing the sorted ordering of MIC coefficients ranging from 0 (no signal) to 1 (perfect signal). Also noted are correlation, non-linearity and non-monotonicity figures, as well as a few additional MINE calculations. In my experiments, the results were consistent with findings from other predictive models. The features MINE ranked highly were indeed important, while the non-linearities suggested confirmed earlier analyses.
I next attempted to “trick” the algorithm, feeding it a data set generated by functions of random variables that tied the identified response to trigonometric, logarithmic, power and exponential features. Again, MINE was up to the challenge, calling out my ruse and correctly denoting the strongest relationships while differentiating non-linear and non-monotonic components. Impressive.
I was disappointed, though, when I submitted for analysis a random sample of 100,000 records from my PUMS census data. While the algorithm correctly identified the important predictors of salary, the computations took over 8 minutes. Worse, the run times appear to be linearly related to the number of variables and cases. I had to kill a job that attempted to process 1,000,000 records. Detecting novel bivariate relationships? Yes. In large data sets? Not yet.
I guess I shouldn’t have been surprised. The R MINE code at this point is proof of concept only. I’m sure someone connected to the research will, in time, build a full-fledged R package that’ll robustly interface with the R environment and address the computational complexity of MIC/MINE algorithms. Performance against data having both a large number of records and many features must be enhanced for MINE to be truly “large data set” friendly. Perhaps code optimization with C++ would be of significant benefit.
Based on testing to date, I think MINE can play a role in my evolving predictive modeling methodology, able to calibrate the strength of relationships between outcomes and potential features. I also like that it can partition the identified relationships among linear, non-linear and monotonic components. What it doesn’t do, however, is provide a means to predict new responses. So MINE must be coupled with a full-blown modeling package like general additive models for a complete solution.
I plan on testing MIC/MINE in a future predictive modeling exercise that isn’t automated by regularization techniques. If I’m partitioning a data into training, tuning and testing subsets, I’ll take a sample of maybe 20,000 records from training and run the MIC procedure to calculate statistics of each feature with the designated responses. Those computations should provide a preliminary feel for feature importance as well as indicate which where I must acknowledge non-linearities. The findings on feature performance would then indicate the choice of model type and feature set as a starting point for subsequent training, tuning and testing.