What does "data mining" really mean? As a consultant and instructor in this field, I see a lot of confusion among people looking to extract the most value from their databases.

Usually they assume that data analysis is the same thing as data mining. In the traditional query-driven approach, the analyst generates a series of questions based on his or her domain knowledge, perhaps guided by an idea (hypothesis) to be tested. The answers to these questions are used to deduce a pattern or verify the hypothesis about the data.

Businesses which rely on queries, reports and OLAP systems often consider these activities to be data mining; but, at best, they are only the first step. They run into trouble when they try to generalize from the information they've uncovered and use it as a guide to future behavior. A description is not the same as a prediction.

Data mining uses a variety of data analysis tools to discover patterns and relationships in data that can be used to make reasonably accurate predictions. It is a process, not a particular technique or algorithm. I want to emphasize that the goal of data mining is prediction, generalizing a pattern to other data. Exploring and describing the database is merely the starting point.

The traditional approach falls short on several counts when it comes to making useful predictions. First, the analyst may fail to select the most appropriate attributes (columns in the database). It may be easy to decide that annual purchases is a more significant number than customer ID; but when you're dealing with 5 million cases, each of which has 200 attributes, it is extremely difficult to identify everything that is important.

As database structure grows increasingly complex (e.g., 50 million cases each with 2,000 attributes), it becomes virtually impossible for any individual to know the data well enough to say with confidence which variables affect behavior. The difficulty is exacerbated by the fact that the best predictors may not be individual attributes, but rather a combination of attributes.

Because data mining is essentially an iterative process, quantitative results go through a reality check and are revised as needed until a meaningful predictive model evolves. The knowledge of the domain expert guides the analysis of the data and the manipulation of variables.

Data mining also addresses another failing of the descriptive approach. Even after a pattern is unearthed through a series of queries, the analyst can't be sure whether that pattern holds true for anything other than the collection of data used to find it. The analyst may try to identify potential buyers of a certain product after building a profile of customers who have already bought that product, but will this profile apply to people who are not yet customers?

For example, analysis may show that 75 percent of purchasers for a certain retail product are male. Therefore, the retailer decides to target men as the likeliest potential buyers in the future. However, if the store's overall customer distribution is 75 percent male and 25 percent female, there's not much new information in the fact that 75 percent of this particular product's buyers are male. Data mining might reveal that education and age are better predictors of buying behavior than gender. Perhaps this product will be especially popular with a particular demographic segment of women, implying a very different promotional strategy than initially planned.

Data mining methodology, on the other hand, tries to verify that the patterns you find can be used for prediction (i.e., that they are applicable beyond the original database). It does this using a variety of techniques, such as dividing the database and developing a predictive model on one portion that is then tested on the other portion. Data mining can assess both the mathematical accuracy and the potential costs and revenues of a particular predictive pattern. (If it costs $100 each to reach the ideal buyer for your $25 product, you might want to modify your marketing plan.)

Clearly, there is more to data mining than just summarizing and querying the database, but running algorithms should only require 10 to 20 percent of a project's time and resources. The bulk of the effort needs to be spent on data preparation, which includes building the data mining database, exploring the data and transforming the data for mining. As predictive models are generated, they need to be evaluated to ensure that they are meaningful. The ultimate results can be very rewarding.

This column will offer a series of short explorations of important and interesting data mining issues, based on the questions and concerns of my consulting clients and the people who attend my classes on data mining. I recognize that there are a diversity of approaches and opinions within the data warehousing, data mining and business intelligence communities. Therefore, I invite you to share your ideas with me. Please e-mail (feedback@twocrows.com) your comments, questions or suggestions about subjects you would like me to address.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access