APR 1, 2006 1:00am ET

Web Seminars

Treating Big Data Performance Woes with the Data Replication Cure
May 23, 2012
Data Discovery for Big Insights
Available On Demand
Business Insight at Your Fingertips: Bringing Analytics to the Masses
Available On Demand

The Power of Metrics

Print
Reprints
Email

As more and more organizations integrate their disparate data sources into "one single source of truth" by leveraging integrated enterprise resource planning (ERP) systems, operational data stores (ODS), data warehouses and federated solutions, it becomes quickly apparent that the anticipated big-bang ROI never materializes. Campaign management applications, supply chain management systems and financial management suites are often layered on top to provide effective environments for the storage, access, semi-analysis and visualization of data. Scorecards and dashboards also provide significant insight into the health of an organization by tracking key performance indicators (KPIs). In reality, however, this is only the first step in understanding, quantifying and managing the business. Much work and effort still needs to be invested in understanding the business drivers and their critical success factors. Predictive analytics with its portfolio of statistical and data mining models can play a significant role in identifying and understanding the critical business drivers and root causes that facilitate realization of the big-bang ROI.

In my March column we profiled a portfolio of predictive analytic models that could be leveraged to improve understanding of the business drivers (see Figure 1 in my March column). The business analysis challenges were divided into five distinct categories: classification, clustering, association, estimation and description. This month, I will focus on the classification category and the decision tree models specifically.

Profiling the Decision Tree

Decision trees start with a training set of sample cases where the target field (i.e., good credit risk individuals versus bad credit risk individuals) are predefined. As an example, all the attributes associated with a sample customer case such as age, income levels, house ownership and credit card spending are analyzed to determine which attributes best differentiate the good credit risk from the bad credit risk. At each branch, decision rules are developed to classify the customer cases into the optimum splits - this process is known as recursive partitioning. When compared to other predictive analytic methods, the real strength of decision trees lies in the capability to easily convert the simplistic branching heuristics into if-then rules that are easy to generate and understand. This also facilitates the tremendous amount of calculations necessary when millions of cases need to be classified and scored. Some of the important components of the decision tree method include:

The Topology. The decision tree is a flow chart-like tree structure that starts at a root node and branches down through intermediate nodes to a lowest level leaf node, which represent separate classes. Each branch represents the outcomes of a test (and hence a splitting criteria) that are created until all cases are split into separate classes. Different algorithms and tree building, pruning and stopping heuristics are employed to optimize the resulting tree model. The various components of a decision tree are illustrated with a credit risk example in Figure1.

Figure 1: Decision Tree Topology (Credit Rating Example) 

The Strategy. The decision tree starts as a single node and branches out until all the heterogeneous (very different) cases have been segmented into more homogeneous (very similar) individual classifications or leaves. Throughout the tree building process, separate decisions are made related to the specific algorithm to be used, the essential heuristics for tree splitting and the appropriate measures for tree pruning. The results for the decision tree model using the training data are applied to the validation data set data to optimize model performance using variance reduction, and miscalculation minimization techniques.

The Algorithms. The development of a decision tree typically leverages one of three different algorithms - CART (Classification and Regression Trees), CHAID (Chi-Squared Automatic Interaction Detector) or C4.5/C5. The are several important dimensions that dictate the preference of one algorithm over another. The first consideration is whether the target variables are categorical or continuous in nature. Another important option focuses on the splitting methodology employed during the decision tree building process. This can include either biway or multiway splitting using such concepts as entropy reduction, information gain and the chi-square test to implement the splits. A more extensive profile of the alternative algorithms is shown in Figure 2.

Figure 2: Decision Tree Algorithms

Building the Decision Tree

Building a decision tree involves the process of adding branches and leaves to improve the predictive performance of the decision tree for the training sample data. This is an important concept - there is a difference between training samples and validation samples although both are selected from the same population of data. At the start of the decision tree building process, the total population of case data is divided into two separate data sets - training samples and validation samples. The decision tree rules are first developed on the training set of sample cases by starting at the root node and trying to find the input variable that performs the best job of candidate splitters for the branches and leafs from among the available attributes. The branching process can be approached from two alternative scenarios. The first approach grows the full blown tree by developing formalized branching rules from the attributes and then prunes back to the best explanatory set of attributes. The second approach entails implementing stopping rules during the tree building process. Tree building and its components are discussed in the section below.

The Tree Building Steps. These are the five steps to follow:

  1. The decision tree starts at single root node using the training sample data.
  2. If all the sample cases are of the same class then the node becomes a leaf.
  3. If the sample cases are in different classes, measures such as entropy reduction and information gain are used to split the sample cases into classes.
  4. A branch is created for each known value of the test attribute with each case sample partitioned by the splitting rules.
  5. Steps 2 - 4 are repeated until the stopping rules kick in (see below).

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.