- There is no "best" tool; that is, no best tool for everyone.
- The most useful tools are those that facilitate well the greatest number of tasks in the kind of data mining applications that you need to perform.
Major Data Mining Tasks
In the past, data mining tool development has focused primarily on providing powerful analytical algorithms. However, the analytical "engines" handle only a small part of the complete task load in a data mining project. As most data miners know, 70 to 90 percent of a data mining project is consumed with data preparation. Development of tools for data preparation has taken the backseat in most data mining tool evolution. Finally, you must be able to evaluate models properly, in order to compare models, and commend them to marketing staff.
Data Preparation Tasks
Common data preparation tasks include:
- Data assessment to determine:
- Missing values (blanks, spaces, nulls)
- Outlier values
- Collinearity assessment (related to correlations between predictor variables)
- Frequencies of multiple codes in a given variable;
- Merging multiple datasets;
- Mapping metadata (field names and types) from various input formats into a common format for analysis;
- Transforming contents of similar variables into a common format;
- Changing data types from numerical to categorical data types (by binning and classification) and from categorical to numerical data types for use with algorithms with specific input requirements;
- Splitting variable codes into separate fields, and combining of multiple fields into a single field; and
- Deriving new variables from existing variables. Most data miners discover that some of the most predictive variables are those that they derive themselves.
Most data mining tool sets only "minor" on these important data mining tasks. This evaluation will "major" on the ability of common data mining tools to facilitate these tasks.
In addition to providing tools for doing important tasks of preparing data for modeling, a good data mining tool for direct marketing should include tools for evaluation of the models created by the modeling exercise.
Model Evaluation Tools
In analytical theory, the best model is one that has the greatest accuracy in predicting all classification states of the target variable and is acceptably robust in its agility to perform well on the validation data set. That means we must consider the combined accuracy of predicting responders and nonresponders. This approach is called the Global Accuracy method. Most data mining tools use this method to identify the "best" model. However, there is a "fly" in this ointment. Embedded in the theory behind the Global Accuracy evaluation method is the assumption that the costs of all types of classification errors are the same. This approach works well in the classroom, but it does not work well in CRM data mining operations, particularly those that drive direct mail (DM) campaigns. In fact, this is one of the major reasons why many CRM initiatives to support DM campaigns have failed to produce much business value in the past. Models have been evaluated largely on a basis that is only partly relevant to the only things that marketers care about: maximizing positive customer response and minimizing the cost of doing so. Most data mining tools focus on the combined accuracy of prediction but ignore the cost element entirely.
In DM campaigns, the cost of mailing to a prospect that does not respond (referred to as a "false-positive" error) is rather small; but the potential cost of not mailing to a prospect that would have responded ("false-negative" error) can be rather large (reflected in the lifetime value of membership fees not paid and other services not purchased). This means that DM model evaluation methods should focus on minimizing the false-negative errors, rather than the false-positive errors. Because marketers care only about response rates and costs, a mailing to the top three deciles that hits 60 percent of the responders is likely to satisfy both concerns. Mailing to the non-responders (false-positive errors) in the top three deciles is an acceptable cost to the direct marketer for the sake of contacting 60 percent of the total responders available in the target area. This situation represents a 100 percent lift over random expectation and is much more cost-effective than a mass mailing approach.
Most data mining tools employ the global accuracy method for model evaluation. You may be forced to accept this method to identify the "best" model using the tool's reporting capabilities. The best model among many performed with different algorithms should not be evaluated by comparing the accuracy reports of each tool. Rather, evaluation should focus on how well the model clusters the positive responders in the top deciles of a scored list sorted on the prediction probability. Even classification algorithms can output classification probabilities. The actual classification (e.g., 0 or 1) is a highly summarized expression of the classification probability (e.g., <0.5 = 0; ≥ 0.5 = 1). Here lies a lot of the true "gold" hidden in the capability set of the tool. The naive CRM data miner will focus on the classification and accuracy thereof, but the true "gold" of CRM data mining must be expressed in terms of probabilities for retention, purchase and new customer acquisition.
A cumulative lift table (e.g., Table 1) must be inspected to determine how effective the model is in clustering true-positives in the upper deciles. This table can be created by:
- The prediction probabilities are sorted in descending order.
- The sorted list is divided into 10 segments (deciles).
- Count the number of actual hits (actual responders in the modeling dataset) in each decile.
- Calculate the random expectation per decile by dividing the total number of actual responders by 10. This means that 10 percent of the total responders are expected in each decile. If the percentage of hits exceeds the random expectation, the model provides a lift in that decile (over random expectation).