As more and more organizations integrate their disparate data sources into "one single source of truth" by leveraging integrated enterprise resource planning (ERP) systems, operational data stores (ODS), data warehouses and federated solutions, it becomes quickly apparent that the anticipated big-bang ROI never materializes. Campaign management applications, supply chain management systems and financial management suites are often layered on top to provide effective environments for the storage, access, semi-analysis and visualization of data. Scorecards and dashboards also provide significant insight into the health of an organization by tracking key performance indicators (KPIs). In reality, however, this is only the first step in understanding, quantifying and managing the business. Much work and effort still needs to be invested in understanding the business drivers and their critical success factors. Predictive analytics with its portfolio of statistical and data mining models can play a significant role in identifying and understanding the critical business drivers and root causes that facilitate realization of the big-bang ROI.
In my March column we profiled a portfolio of predictive analytic models that could be leveraged to improve understanding of the business drivers (see Figure 1 in my March column). The business analysis challenges were divided into five distinct categories: classification, clustering, association, estimation and description. This month, I will focus on the classification category and the decision tree models specifically.
Profiling the Decision Tree
Decision trees start with a training set of sample cases where the target field (i.e., good credit risk individuals versus bad credit risk individuals) are predefined. As an example, all the attributes associated with a sample customer case such as age, income levels, house ownership and credit card spending are analyzed to determine which attributes best differentiate the good credit risk from the bad credit risk. At each branch, decision rules are developed to classify the customer cases into the optimum splits - this process is known as recursive partitioning. When compared to other predictive analytic methods, the real strength of decision trees lies in the capability to easily convert the simplistic branching heuristics into if-then rules that are easy to generate and understand. This also facilitates the tremendous amount of calculations necessary when millions of cases need to be classified and scored. Some of the important components of the decision tree method include:
The Topology. The decision tree is a flow chart-like tree structure that starts at a root node and branches down through intermediate nodes to a lowest level leaf node, which represent separate classes. Each branch represents the outcomes of a test (and hence a splitting criteria) that are created until all cases are split into separate classes. Different algorithms and tree building, pruning and stopping heuristics are employed to optimize the resulting tree model. The various components of a decision tree are illustrated with a credit risk example in Figure1.
Figure 1: Decision Tree Topology (Credit Rating Example)
The Strategy. The decision tree starts as a single node and branches out until all the heterogeneous (very different) cases have been segmented into more homogeneous (very similar) individual classifications or leaves. Throughout the tree building process, separate decisions are made related to the specific algorithm to be used, the essential heuristics for tree splitting and the appropriate measures for tree pruning. The results for the decision tree model using the training data are applied to the validation data set data to optimize model performance using variance reduction, and miscalculation minimization techniques.
The Algorithms. The development of a decision tree typically leverages one of three different algorithms - CART (Classification and Regression Trees), CHAID (Chi-Squared Automatic Interaction Detector) or C4.5/C5. The are several important dimensions that dictate the preference of one algorithm over another. The first consideration is whether the target variables are categorical or continuous in nature. Another important option focuses on the splitting methodology employed during the decision tree building process. This can include either biway or multiway splitting using such concepts as entropy reduction, information gain and the chi-square test to implement the splits. A more extensive profile of the alternative algorithms is shown in Figure 2.
Figure 2: Decision Tree Algorithms
Building the Decision Tree
Building a decision tree involves the process of adding branches and leaves to improve the predictive performance of the decision tree for the training sample data. This is an important concept - there is a difference between training samples and validation samples although both are selected from the same population of data. At the start of the decision tree building process, the total population of case data is divided into two separate data sets - training samples and validation samples. The decision tree rules are first developed on the training set of sample cases by starting at the root node and trying to find the input variable that performs the best job of candidate splitters for the branches and leafs from among the available attributes. The branching process can be approached from two alternative scenarios. The first approach grows the full blown tree by developing formalized branching rules from the attributes and then prunes back to the best explanatory set of attributes. The second approach entails implementing stopping rules during the tree building process. Tree building and its components are discussed in the section below.
The Tree Building Steps. These are the five steps to follow:
- The decision tree starts at single root node using the training sample data.
- If all the sample cases are of the same class then the node becomes a leaf.
- If the sample cases are in different classes, measures such as entropy reduction and information gain are used to split the sample cases into classes.
- A branch is created for each known value of the test attribute with each case sample partitioned by the splitting rules.
- Steps 2 - 4 are repeated until the stopping rules kick in (see below).
The Splitting Rules. The goal of the splitting rules is to build a decision tree that assigns a class (or probability of membership in a class) to the target field of a new sample case based on the values of the input attributes. At each node the tree is split into child branches according to the attribute that is most definitive in separating the case samples into a class where a single group predominates. Splits on numeric values take the form of (X less than constant value) for one child and (X greater than or equal to constant value) for the other child. For categorical values, splits can take on the value of the category itself (i.e., male/female).
The Splitting Measures. The measure that is used to evaluate a potential split is often referred to as purity. High purity means that members of a single class predominate while low purity means that a mixed amount of different classes are present. Entropy measures the level of purity for a split. For example, if the left child split contains nine good credit risks and only one bad risk while the right child split contains one good credit risk and nine bad risks than the purity is high and the entropy for the node is 0.47. If the split was ten good credit risks and no bad risk for the left child, and no excellent credit risks with ten bad credit risks for the right child than the entropy is 0. Alternatively, if both the first child and second child contained five good risks and five bad risks than the purity is very low and the entropy measures 1.0. In fact in essence we have random behavior. The bottom line is as follows: the closer to 0, the greater the entropy, and consequently the greater the purity. This entropy value is often multiplied by the proportion of records to create a measure called information gain. Another potential splitting criteria is called the Gini score. See Figure 3 for a comparison of value levels.
Figure 3: Purity Measures
The Stopping Rules. The stopping rules determine when the recursive splitting process stops. This can be governed by predetermined user rules such as the depth of the tree that has reached some preset limit or the number of cases in the leaf has reached some preset lower bound. The recursive splitting process itself can also be brought to a halt where all the remaining samples for a given leaf belong to the same class or no remaining attributes exist to further partition the data cases. These rules and measures will allow the development of either a full-blown decision tree or some subset of the full blown decision tree. The decision tree still needs to be optimized for the more general population of sample cases rather than just the training set of cases. The next section uses pruning as an appropriate technique for addressing this issue.
Pruning the Decision Tree
Pruning is the process of removing branches and leaves to improve the predictive performance of the decision tree. This results in a subset of the full blown decision tree. This may sound strange that a subset of the decision tree is more predictive that the full blown decision tree, but keep in mind that the full blown tree has trained on a very specific set of case data which may possess some anomalies due to noise or outliers that are not applicable to the generalized population of the sample case data. Although the tree finds general patterns at the big nodes, the tree finds patterns specific to the training set in the smaller nodes. This is known as "overfitting" the data and should be avoided at all costs. Pruning provides an approach and methodology to create a more parsed-down decision tree that reflects case data drawn from the same total population rather than just the specific training data.
The Pruning Methodology. Pruning can be approached from two distinct perspectives. The tree can be pruned in the build stage - this is called prepruning and can be implemented by using the stopping rules mentioned above to halt tree growth. The other approach - postpruning - involves removing branches from a full blown tree by pruning those branches providing the least amount of predictive power per leaf node. Some of the more prevalent measures and techniques for pruning the tree model are discussed next.
The Pruning Measures. The measure that is used to identify the leaves with the least predictive power for CART is called the adjusted error rate. The decision tree model developed in the build stage with the training data is now applied to the validation data set. The result from this analyzes helps to identify those branches where the misclassification rates are not low enough to overcome some minimal hurdle level. These misclassification levels (or error rates) are then normalized by the number of leaves in the tree to create the adjusted error rate. Plotting the adjusted error rates versus the number of leaves in the tree will indicate the optimal tree model(s) that should be selected for the training data. Alternative approaches are used for CHAID (Chi-square test) and C4.5/C5.0 (error rate at each node).
Evaluating the Decision Tree
All of the previously mentioned pruning was completed using the training set of data. The last step is to run the validation data through the pruned decision tree and its various subtrees to calculate their miscalculations and respective error rates The error rates for both the training and validation data are then plotted versus the tree depth to determine the appropriate pruning for the validation data (see Figure 4). After selecting the pruning level, the model results can then be used to score all the customer cases.
Figure 4: Decision Tree Model Optimization
As you can see, decision trees can be extremely useful in classifying customers, suppliers and employees into meaningful classes within the general populace. Decision tree models provide major opportunities to create targeted promotional programs, mitigate financial risk and identify customer/employee care opportunities that can have significant impact on the company's ROI.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access