My company, Inquidia Consulting, is currently engaged in/completing several predictive analytics and data science projects. While we distinguish PA from DS, there's often not a hard dividing line between the two with our customers. Indeed, though we demur, some now consider data science to be any application of statistical methods to business problems.
For Inquidia, both PA and DS generally involve statistics and machine learning of some sort, often “climaxing” with predictive models trained and validated on existing data. The ultimate goal is to deploy the models to make go-forward predictions in a business process.
Inquidia's PA work is usually more narrowly focused than its DS cousin, often as not a particular modeling task with relevant data identified in advance for a relatively short-term project. And the PA customer may suggest “theories” on what the final models might look like for us to test. R, Python and SAS are preferred PA platforms.
DS projects, in contrast, are more comprehensive but nebulous, with substantial computation/data integration/wrangling, big (and perhaps unstructured) data , and exploration challenges that precede theorizing and subsequent modeling. In many cases, DS work is shaped more by data programming than by modeling. The Cloud, Redshift, Hadoop/Impala, Spark, R and Python are Inquidia's usual suspect DS platforms.
The difference between statistics or predictive analytics and data science in the business world is not unlike the difference in the comparable Masters's curricula that have sprouted up over the last 10 years. Especially in the early days, many PA programs were computationally light. Now, the norm is a more rigorous data and programming focus to complement the standard BI, statistics and predictive models. In fact, DS is often replacing PA as the go-to moniker for new programs just as it is for our business customers. I'm all for this evolution, on record as willing to trade two math/stat courses for one computational with Inquidia's university hires.
The distinction between PA and DS is well noted by eminent Columbia statistician Andrew Gelman: “There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science......To put it another way: you can do tech without statistics but you can’t do it without coding and databases. But in recent years, lots of tech companies have made use of statistical methods (including various statistical ideas that have been developed in the computer science literature). So, from the industry perspective, the new part of data science is the statistics. Statistics is the least important part of data science, hence it is the part most recently added, hence it is the part that is getting the most attention right now.”
My involvement with Inquidia's recent PA and DS work has prompted several observations that may be obvious but are nonetheless worth sharing.
“It's the data” is a bromide that fits PA as well as DS. Yes, PA data is generally better curated than DS, but that doesn't mean it's correct. Inquidia insists on reviewing the scripts that produce the PA data sets and almost always finds problems. If there're no errors in feature construction, there may well be problems with selection filters. And it always pays to assure there's a grouping of variables that uniquely identify each record in the modeling data sets. Duplicate records are not the modeler's friend..
Algorithms are important, but thoughtfully constructed features (independent variables) are more so. Invest in feature formulation. Look for plausible “causal” hypotheses between such predictors and the dependent variables. Also, strive for data designs that function like randomization to eliminate competing hypotheses. The more your data collection resembles an experiment, the better. At a minimum, stress control of feature variation with a wide distribution of values.
Don't be obsessed with the latest and greatest algorithms too early in the modeling process. For supervised learning, standard and logistic regression work just fine at the beginning. Both are quite flexible and implemented efficiently in most statistical packages. And while you may see predictive lift from, say, gradient boosting over basic regression, it likely won't be too noticeable a difference. Your effort may well be better served with focus on construction of superior variables.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access