Solving a problem with Data Science means more than just making a chart. It means identifying a real business problem, wrangling the relevant data, finding the right model, and delivering actionable insights to decision makers.
Let’s consider a workforce management example. Increasingly, organizations are using advanced analytics to optimize their workforces; inspecting historical data to drive employee engagement and productivity.
For instance, consider an organization trying to prevent timecard fraud. Specifically, they want to catch employees or managers that might be editing timecards to manipulate overtime pay. While some timecard manipulations might be easy to catch, others are the result of subtle changes that can be lost in a sea of hundreds of thousands of legitimate transactions.
This example is a classic “needle in the haystack” problem, but we can solve it by carefully working through the steps below.
Identify the business problem and wrangle the data
The first step in a data science project is to identify the business problem to be solved, and what data can help. Remember, “data exploration” is an important part of the process, but it’s very rarely an end in itself. Instead, focus on the business goal, in our case, “Find departments with timecard fraud”.
Now we need to identify the actionable insights we want to produce. Decision makers need insights that “tip the scales” and point to a specific course of action, not just “cool stats”. A data scientist must reach out to domain experts and decision makers to find out what kinds of data and insights actually affect their decisions. For instance, in our case the experts might tell us that they usually look for repeated evidence of fraud.
Finally, we need to make sure we have all the data we need and the technology to process it. In our example we need the audit trail and the Big Data infrastructure to wrangle and process it. Once these challenges are resolved, the modeling phase can begin to turn raw data into insights.
Choose your model(s)
When choosing a model, it is usually best to proceed hierarchically based on the business requirements and properties of the models.
At the top level, is the problem a fit for sophisticated machine learning models or simpler statistical approaches? If it is machine learning, do you have a specific outcome of interest (like fraudulent transactions) already marked in the data? If so you can use “supervised” learning techniques, but otherwise you might need to rely on “unsupervised” methods.
Drilling down in this manner ensures you’re choosing a model aligned with both your business goals and data constraints.
Finally, test many different models on the actual data. Only data, not intuition, can validate a model. When you see the output, you need to ask yourself two questions:
• How accurate and precise is my model? • Is the model producing actionable intelligence?
Standard measures of model fit and validation can give you quantitative answers to the first question, but keeping the business problem in mind is important when interpreting these numbers. In cases of fraud detection, even a 10% increase in prediction accuracy might mean millions of dollars in savings.
For the second question, data scientists need to make a more qualitative assessment: Do the results of this model tell me what to do? Are the results enough to “tip the scales” of a decision maker? If so, the model might be right, but now the insights need to be packaged.
Package actionable insights
Sometimes, the outputs of a model may seem actionable to a data scientist, but decision makers would find them hard to interpret. For instance, suppose a model predicts the expected number of fraudulent timecards in a department as well as the expected variance.
Repackaging this information in a form like “There’s an 80 percent chance this department has more than five fraudulent transactions” will make the insight more clear to decision makers.
For important strategic decisions, the packaging also needs to allow consumers to explore the data on their own terms. Dashboards and other interactive tools allow users to “slice and dice” data and take more surgical actions. For instance, after finding out a department has a high probability of fraud, decision makers might want to view this trend over different timescales to see if it is a repeated pattern. All of these phases: identifying the problem, wrangling the data, choosing a model, and packaging the insights are crucial components of a data science project. By chaining them together, you can create highly valuable data driven solutions to real business problems. About the author; Dr. Thomas Walsh is a data scientist with Kronos Incorporated where he applies machine learning and Big Data techniques to workforce management problems. He received his Ph.D. in computer science at Rutgers University, and previously held research positions at MIT and the University of Kansas.