A few days ago, I ran across an interesting blog entitled “The Big Data Brain Drain: Why Science is in Trouble”, by Jake Vanderplas, a post-doctoral fellow in Astronomy at the University of Washington.
Vanderplas’s thesis is that scientific research is increasingly about the analytics of big data, and that “the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry.” Moreover, “The unfortunate result is that some of the most promising upcoming researchers are finding no place for themselves in the academic community, while the for-profit world of industry stands by with deep pockets and open arms.”
Vanderplas proffers that “in a wide array of academic fields, the ability to effectively process data is superseding more classical modes of research.” The challenge for students working on their PhDs, accordingly, is to become computationally competent, just as the challenge for academia is to accommodate new-age analytics. To do so, academia must change its reward structure.
The author proposes modifications to the academic career model consistent with the enhanced focus on big data and computation. He starts with an obsession on reproducibility whose point of departure is open, documented code. He then proposes changes to traditional tenure track evaluation criteria to include the development of cross-disciplinary scientific software tools. Vanderplas finally promotes an increase in post-doctoral fellowship pay, so that newly-minted PhDs are not easy prey for a business world increasingly turning to data science.
My take is that Vanderplas is spot on about the increasing use of big data sets and sophisticated computation in university research. It’s certainly true in the social sciences, my old stomping ground. Gary King, Professor of Government and Director of the Institute for Quantitative Social Science at Harvard, is comfortable tagging much of the institute’s current work as data science. And a group of prestigious academics has christened the term “computational social science” to connote the evolving big data methodology in their discipline. Not to mention the physical sciences, which are well ahead of their social science brethren in computation and big data analytics.
OpenBI’s 2013 college recruiting provides additional affirmation of the accelerating academic emphasis on big data and computation, much to our delight. Just four years ago, I voiced my frustration with the absence of computational skills in many of the students I met, even those with strong math and engineering backgrounds. I’d gladly trade a third year of mathematical analysis, I then mused, for semester courses in programming and numerical methods.
This year, however, almost everyone I spoke with had at least some programming and computation background, be it university or self-taught. And analytics was high on everyone’s ideal job description list, even those majoring in theoretical math, the physical sciences and engineering.
In addition, I interviewed several science PhD candidates who, confirming Vanderplas’s observations, developed computational skills doing research for their dissertations. These students became enamored with data and analytics, deciding to pursue data science work in the business world as a trade up from low-wage post-doc positions. Academia’s loss is the business world’s gain!
Not all are happy with the big data/computation direction of scientific research. Some traditionalists scoff at the “haphazard” data mining that’s seemingly supplanting the rigorous, top-down scientific method, in many cases replacing traditional statistical methods with “mindless” machine learning algorithms. Science is to be top-down and hypothesis-driven, not bottom-up mining.
I don’t buy it. Yes, the new methods often consist of big data set exploration. But with a solid collection design and rigorous cross-validation or train/tune/test data partitioning, the problems of overfitting models and making non-reproducible causal claims can be minimized. The “exploratory” findings can be subjected to more rigorous experimental or quasi-experimental testing in subsequent research. Today’s bottom-up is the foundation for tomorrow’s top-down.