BACKGROUND: American Express Company is a global travel, financial and network services provider. The company serves individuals with charge and credit cards, travelers cheques and other stored-value products. It also offers financial planning, brokerage services, mutual funds, insurance and related investment products.

PLATFORMS: American Express runs CART, TreeViewer and TreeCoder in a Windows NT and UNIX (Sun Solaris 2.6) environment.

PROBLEM SOLVED: More art than science, data mining is largely an atheoretical practice. Results are dependent entirely on the data and its characteristics (missing values, outliers, "dirty" data, etc). Consequently, the data miner (or modeler) must arbitrarily choose the "best" model and is never certain of the precision of the parameters estimates (e.g., weights in a neural net). As the first theoretically founded data mining algorithm, CART may signify the eventual transformation of data mining from an art to a science. CART will always produce an optimal model. It is not data dependent and, as such, does not overfit the data or build a decision tree when none exists in the data. Further, CART handles missing values in a theoretically correct manner. With CART, the data miner is far more confident that the CART model is the best model possible. Because CART is supported by a rich and rigorous theoretical foundation, CART is used when the results really matter (for example, to target likely responders in direct mail campaigns, to segment customers and to identify Web site visitors that are most likely to abandon shopping carts). The theory underlying CART, coupled with the accessibility of advanced model-building controls, make it ideal for identifying and accurately classifying very rare events such as profiling individuals committing online fraud or hacker attempts into a network. Additionally, CART frequently identifies complex interactions that can then be incorporated into predictive models. For example, one interaction term identified by CART as a dominant driver accounted for approximately 70 percent of the predictive accuracy of a model built with logistic regression.

PRODUCT FUNCTIONALITY: The desktop version of CART provides a seamless interchange of data in a user friendly, point-and-click interface. The UNIX version features thin-client architectures that enable users to interactively navigate output. Both Windows and UNIX versions generate cut-and-paste model source code. CART handles dirty data with a modest amount of data preparation and features an automatic procedure for handling missing data that has proven to be highly effective across a variety of problems.

STRENGTHS: Theoretically sound and accurate models can be quickly developed and implemented. TreeCoder automatically generates the SAS code required for many implementations, and C-based source code will be available in future releases.

WEAKNESSES: Classification problems involve "yes/no" or "A or B or C" type categorical target variables, whereas regression problems involve continuous target variables such as dollars spent and insurance loss. CART is a superior performer for classification trees. However, neither CART nor any other decision tree is especially good on regression problems. To address this weakness, Salford Systems recently introduced MARS, a new data mining tool that shares CART's scientific pedigree but excels at regression.

SELECTION CRITERIA: When results matter, CART is my tool of choice. With clean and robust data, theory dictates and experience confirms that CART classifications are at least as accurate as those from regression-based techniques (and numerous other tree-based classifiers). With dirty data, theory dictates and experience confirms that CART classifications are superior to (more accurate than) results from regression-based techniques (and numerous other tree-based classifiers).

DELIVERABLES: CART provides a vast and rich array of graphical and numerical reports on variable importance, splitters (predictors), optimal trees and various descriptive statistics. For presentation purposes, CART provides the ability to cut and paste output directly into PowerPoint and other high-end presentation programs. For deployment purposes, TreeCoder provides fast and easy installation of the model.

VENDOR SUPPORT: As with any memory-based reasoning technique, successful deployment and utilization of the tool is frequently dependent upon the ability of the vendor to optimize and configure the software subject to memory constraints. Salford provides optimization subroutines within the CART software as well as custom optimization support.

DOCUMENTATION: Quantitative professionals should find the documentation, training and literature available through Salford Systems to be both rewarding and intellectually stimulating.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access