Continue in 2 seconds

What Is A Data Mining Product?

Published
  • June 01 2003, 1:00am EDT

In all our discussions of data mining, we have not yet addressed the question of what are the most important attributes to look for when buying a data mining product. In this column, I will describe the components of a general-purpose (horizontal) tool and assess the current state of the practice.

Horizontal data mining tools are aimed at data mining analysts who may be statisticians, business analysts or experts in a particular business domain. Leading examples of these tools include SPSS's Clementine, SAS's Enterprise Miner, Insightful's Insightful Miner, Salford's CART and MARS, and a host of other tools.

All these tools must have the following components:

  1. An interface that supports user interaction (either a procedural language or a GUI).
  2. A way to read data into the product.
  3. One or more algorithms for building models.
  4. A way to view and evaluate the results of the model-building effort.

Interface

Most vendors make at least a token effort to produce a GUI, though few stand out. Clementine and Insightful Miner have similar GUIs that facilitate the model-building process. SAS's JMP Version 5 is usually considered a statistics tool, but the addition of a decision tree and neural net warrant its inclusion when discussing data mining tools. It has an unusual but effective interface.

The importance of a good interactive GUI was brought home to me recently in a meeting where we were using a tool with a good GUI to examine data and try alternative models. The client was excited about how this kind of interactive exploration and model building could improve the way his company's domain experts work with their statisticians. While I have always viewed a good GUI as an important productivity tool in and of itself, his observation underscored how it can change the nature of the way data is used in a business.

While data mining GUIs have steadily improved, there is still room for considerable improvement.

Import

For the most part, data import facilities are pretty decent now. Most major database management systems (DBMSs) are accessible either directly or through ODBC, although I am amazed when I cannot easily import an Excel spread sheet or an ASCII file without writing a procedure.

Algorithms

The biggest strength of data mining products is the quality and variety of their algorithms. While vendors like to argue about whose is best (and there are significant differences), the truth is that the main products all have at least a few good algorithms, including decision trees and neural nets. Linear and logistic regression have become quite common, not only for their own utility but to increase the appeal of data mining software to statisticians. Clustering algorithms are also normally part of the package.

Evaluation

Data mining products generally provide some measures of goodness of fit where appropriate. Most of the products have lift or gains curves and return on investment (ROI) curves, while a few offer receiver operating characteristic (ROC) curves. What surprises me is that few products include plots of residuals or make the residuals easily available for graphing. Residuals are the difference between the predicted value and actual value. They provide valuable insights into the quality of a model and can suggest further actions needed to improve a model.

Beyond these four key components, there are some additional functions that are all too often neglected in products but are extremely important (and only optional in the minds of the product creators):

  • Data preprocessing and transformation.
  • Data exploration (graphics and query).
  • Model deployment.

Transformation

Transformation is the weakest part of every data mining product I know. More of the better GUI-based products support basic data manipulation functions such as sampling, merging, filtering or balancing a data set. They also support field-level transformations such as defining a new field (preferably with an expression builder) or excluding columns. Often, however, these graphical tools are too limited in their functionality. While some products (SAS, Insightful and, to a lesser extent, SPSS) have a powerful procedural language that in principle will let you do any transformations you want, in practice these languages are complicated to use, tedious to learn and lack the features of modern programming languages such as good debugging tools. Being forced to write your transformations in SQL or resorting to another product to write them is not a good solution.

Exploration

Data exploration is also not where it needs to be yet, although the new versions of Clementine (7.0 and later) and Insightful Miner (3.0 and later) have vastly improved graphical exploration capabilities over their previous versions and, along with SAS JMP, have pretty good graphical exploration capabilities. Writing code to produce graphs, no matter how good the resulting graph, is unsatisfactory because it removes the interactivity necessary for effective exploration.

Model Deployment

Model deployment has vastly improved. Most modern products allow a model to be exported for inclusion in an application, and many will also capture the transformations necessary to execute the model. By this I mean that the predictor variable in a model may not be database fields A and B, but rather the ratio of A to B; thus, the quotient must be calculated before you build the model. When a model is exported to be included in an application or to score data, all the transformations performed on the data should be part of the exported model, rather than requiring the user to write a procedure to perform them.

The state of the practice for data mining tools has come a long way since my company issued our first tool evaluation in 1996. Today there is a wide range of tools to fit the variety of data mining organizational requirements. In subsequent columns, we'll look at other categories of tools.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access