Differences Between Statistics and Data Mining
From a business perspective, it doesn't really matter what you call it: statistics, data mining or predictive analytics. Competitive advantage comes from making better decisions faster and more confidently.
A deceptively simple question triggers lively debate among analytical professionals: What is the difference between statistics and data mining?
Wikipedia defines statistics as, "A mathematical science pertaining to collection, analysis, interpretation and presentation of data." Statistics draws valid conclusions and makes reasonable decisions on the basis of such analysis. It further states that predictive analytics encompasses a variety of statistical techniques that process current and historical data in order to make predictions about future events.
I contend that data mining is a form of predictive analytics that uses a variety of techniques to explore massive amounts of data to identify relationships between hundreds of data elements - relationships that could not be uncovered through simple queries or reports. Data mining methodologies overlap with those in analytical disciplines such as statistics (simulation, principal components, Bayesian methods), forecasting (regression, time-series analysis) and operations research (clustering, neural networks, genetic algorithms).
Problems such as predicting customer behavior, identifying fraud and optimizing goods in a supply chain often require a combination of analytical disciplines, business knowledge and data management expertise to solve.
Where is Data Mining Being Used in Business Today?
Data mining has had its broadest success in the area of modeling customer behavior. Data mining techniques can be used to measure customer profitability, predict churn and acquisition rates, and model acquisition costs.
Leading retail firms use data mining to profile stores and merchandise to better align their customers' purchasing patterns with store inventory. Banks and telecommunications firms are targeting customers for additional products and services. Specialized data mining models are used by many financial institutions to grant or deny credit to applicants. These businesses benefit from more responsive and targeted interactions with customers and, ultimately, from higher profits and reduced risk.
Online retailer 1-800-Flowers.com uses a data-driven decision-making process for managing customer relationships. Collecting data at all customer contact points, the company turns that data into knowledge for understanding and anticipating customer behavior, meeting customer needs, building more profitable customer relationships and gaining a holistic view of a customer's lifetime value.
In order to increase response rates and identify profitable customers, 1-800-Flowers.com relies on data mining technologies to discover trends, explain outcomes and predict results. Because the company is able to access better customer information, it has reduced the amount of time it needs to spend on the phone with its customers.
CIO Enzo Micali views information technology as an invaluable element of 1-800-Flowers.com's corporate success. The company has a multitiered information delivery framework that puts strategic information directly into the hands of business users. Micali explains, "The decision process for CRM [customer relationship management] permeates our entire organization - on the back end gathering data from multiple operational systems and on the front end using the data to make better, more reliable decisions. Customer data, accessible through our company intranet, can be securely viewed at many different levels, including departmental views, which present data for unique divisional needs and common views which show a general snapshot of customers, including order history and household data across the whole family of our brands."
Why are Businesses Turning to Predictive Analytics?
Two primary drivers have emerged: competitive advantage and compliance. Businesses need to be more nimble in reacting to changes in their environment, and many believe that a data-driven decision-making process that includes predictive analytics will enable high-quality, consistent, repeatable and auditable decisions.
Professor Tom Davenport, the director of Research for Babson's School of Executive Education at Babson College, recently published the results of a research study titled, "Competing on Analytics," based on discussions with C-level executives and directors at more than 30 industry-leading and globally competitive organizations. "The net takeaway of the study is this: The ability to make business decisions based on tightly focused, fact-based analysis is emerging as a measurable competitive edge in the global economy," Davenport says. "Organizations that fail to invest in the proper analytic technologies will be unable to compete in a fact-based [data-driven] business environment."
What is Data-Driven Decision-Making?
Data-driven decision-making is a process that requires collaboration and a variety of skills across all levels of the enterprise. Predictive modeling is only a small piece of the process. A large piece of the process revolves around the data: data acquisition, data quality, data manipulation and data distribution. In fact, good decisions cannot be made without reliable, high-quality data.
The Data Story
From my perspective, the most difficult part of the process is the mathematical formulation of a model that describes the problem you are trying to solve. Often, the analytic methods used will depend on what data is available. Working together, IT and the analytic teams need to identify where the data resides within the organization and what format it is in (relational data tables, spreadsheets, enterprise resource planning [ERP] systems). Are there multiple instances of the data that don't match? Is the data complete? Is additional data from external sources (demographic or socioeconomic data) needed?
Figure 1: Steps to Data-Driven Decision-Making
Doing exploratory data analysis on a subset of the data and examining the metadata is a common practice for understanding the data. Summary statistics and visualization can be key methods to identifying anomalies in the data that need to be addressed prior to a more in-depth modeling exercise. Data may need to be converted or transformed for use in predictive modeling. Measurement data may need to be standardized. Individual transactions may need to be summarized into new variables representing rates, counts or indicators. Data may need to be reformatted from product or transactional data into customer-focused data. Assumptions about the underlying distribution of the data need to be tested for statistical validity.
In the predictive modeling phase, trade-offs need to be considered between the speed of modeling, the accuracy of the model and how easily it is understood. Business users need to trust the results of the analysis, regardless of their knowledge of analytical methods. Many software packages provide only a few simple methods with limited options, while others provide a wide variety. In general, more flexible modeling strategies lead to better predictions, which impact bottom-line revenue.
No single method works best in all cases. One widely accepted strategy is to try all the most common modeling methods (decision trees, regression and neural networks) and compare them to determine the best model. A common criterion for evaluation is a comparison of the expected profits or losses to actual profits or losses obtained from model results. This criterion enables you to make cross-model comparisons and assessments independent of all other factors.
Delivering the output from the best model to the business user is a key consideration for IT staff. Output from the models can be sophisticated or simple. Output may be fed programmatically into real-time systems, such as database engines, message queues or Web services, triggering real-time alerts or product recommendation offers to call center staff. Alternatively, a set of reports (documents, spreadsheets or presentations) could be generated either statically or dynamically on demand in a Web portal or a dashboard. Ultimately, the information needs to be accessible where and when it is needed, in a context relevant to the decision-maker.
Data-driven decision-making can be used throughout the enterprise to model customer, supplier and operational processes. The models are corporate assets that may have significant financial impact, particularly in the areas of marketing, risk assessment and operations. They must be continually assessed and validated for their accuracy over time.
IT staff will be tasked with managing the data and models throughout the lifecycle (development, test/stage, deploy, track, retire) including version control and change management for audit reporting purposes.
Storing model packages with their metadata allows automated model scheduling, including exception reports and model tracking reports. A common metadata repository provides the ability to perform impact analysis - to analyze and evaluate changes in data definitions or model specifications across the organization before an actual change breaks existing applications.
Of course, decision-making is an ongoing cycle. Information gleaned from one iteration of the cycle should be fed back into the process to make it better the next time.
Data mining and statistics are powerful tools that enable organizations to make more structured, repeatable decisions. The decision-making process begins with data access, data exploration and transformation, followed by predictive modeling. The process concludes with the delivery of information to the decision-makers throughout the enterprise enabling them to take action. From a business perspective, it doesn't really matter what you call it: statistics, data mining or predictive analytics. Competitive advantage comes from making better decisions faster and more confidently.