This is one article of a series appearing in this issue categorizing the users of the corporate information factory.

Gold has been highly valued from the earliest times, and mining for it dates back to the Egyptian civilizations and before. It is rare to find gold in the form of a nugget. The largest known gold nugget was discovered accidentally in Victoria, Australia, in 1869, and weighed 156 pounds. It was called the Welcome Stranger. Since the advent of the information age, data mining has taken on great importance, and comparisons of data mining to gold mining are easy to make. Certainly data miners today are looking for that nugget of information ­ that Welcome Stranger ­ that will achieve competitive advantage for their organizations.

The implementation of the corporate information factory has unleashed a "gold rush" of sorts. Data miners are in a frenzy to equip themselves with tools to help extract precious nuggets of information from mountains and mountains of data found in data warehouses, data marts and exploration warehouses. Once settled with appropriate tools, miners are patient in their search, determined to find the few insights that can make their organizations rich. They survey those mountains, analyzing the data to make sure they get meaningful correlations. Miners are thorough. If the data looks odd for some reason, they check it out, since data outliers might indeed be just what they're looking for (finding an instance of fraud, for example).

Classification Assign records to one of a predefined set of classes.
Estimation Determine values for an unknown continuous variable.
Prediction Classify records according to some predicted future behavior or estimated future value.
Affinity Grouping Determine which things go together, as in a shopping basket.
Clustering Segment a heterogeneous population of records into a number of more homogeneous subgroups.
Description Describe a complex database to increase understanding of the underlying data.

Miners may develop hypotheses for customer segmentation and one-to-one marketing. They categorize customers according to behavior patterns, so they can tailor services to customers based on their needs. They may look to uncover consumer buying trends, so their companies can make better inventory, stocking and shipping decisions. They do a lot of predictive modeling to anticipate customer demands, patterns and behaviors and to discover trends or patterns to help the business boost profits.

Patient and thorough, data miners are also basically curious people. Sometimes they start with a hypothesis that they develop themselves or that is developed by an "explorer" within the organization (see Bill Inmon's article, "Explorers in the DSS/Data Warehouse Environment"). In this case, they are looking to prove or disprove the hypothesis. Other times miners have no hypothesis; they are simply letting the tool discover, in an undirected manner, patterns or trends that may be of importance to the organization. Either way, miners are looking for those nuggets. And they may have to analyze lots of raw ore to get anything of value. In fact, the more data, the better! Miners are looking for relationships between or among data elements ­ dependent variables and possible predictors for customer behavior.

Mining Activities

An excellent book that covers what data miners do is Data Mining Techniques for Marketing, Sales, and Customer Support by Michael J. Berry and Gordon Linoff (1997, John Wiley and Sons). Much of the information on data mining tasks and techniques covered in this article is derived from this highly readable (and essential) book. As Berry and Linoff reveal, data miners approach decision-making with basically six different activities in mind. They are classification, estimation, prediction, affinity grouping, clustering and description (see Figure 1).

Classification is the examination of features of a newly presented object and assigning it to one of a predefined set of classes. Examples of classification are assigning keywords to articles as they come off the news wire; classifying credit applicants as low, medium or high risk; and spotting fraudulent insurance claims.

Second is estimation. Given some input data, miners come up with a value for some unknown continuous variables (such as income, height or credit card balance). Examples are estimating the number of children in a family, estimating the lifetime value of a customer or estimating the probability that someone will respond to a balance transfer solicitation.

Prediction is the same as classification or estimation except that the records are classified according to some predicted future behavior or estimated future value. Prediction is one of the most important activities performed by miners. Examples include predicting the size of the balance that will be transferred if a credit card prospect accepts a balance transfer offer, predicting which customers will leave within the next six months and predicting which telephone subscribers will order a value-added service.

Affinity grouping involves determining which things go together, as in analyzing what items are purchased together in a shopping basket. The task of affinity grouping is to generate rules from data. If two items occur frequently enough, we generate association rules. Examples are determining what things go together in a shopping cart at the supermarket to plan arrangement of items on store shelves or in a catalog, identifying cross-selling opportunities or designing attractive packages or groupings of products and services.

Clustering involves segmenting a heterogeneous population into a number of more homogeneous subgroups or clusters. Unlike classification, there are no predefined classes and no examples. An example of clustering is segmenting a retail customer base into groups of people with similar buying habits.

The final activity miners undertake is description, which involves increasing our understanding of the people, products or processes that produced the data in a complicated database. Description is a component in several other types of mining activities.

Mining Techniques

Data miners apply mining techniques to their tasks or activities. Again, Berry and Linoff identify seven major techniques used, including market basket analysis, memory-based reasoning, cluster detection, link analysis, decision trees and rule induction, artificial neural networks, and genetic algorithms (see Figure 2).

Market basket analysis is a form of clustering used to find groups of items that tend to occur together in a transaction (or market basket). Widely used by grocery stores and other retail outlets, market basket analysis is actionable in planning store layouts, bundling products, offering coupons, and so forth.

Memory-based reasoning is a directed mining technique that uses known instances as a model to make predictions about unknown instances. An example might be an insurance claims database where the claims were adjusted after investigation. To determine if a new claim warrants investigation, similar claims (neighbors) in the database would be defined and "investigate further" vs. "pay immediately" decisions would be made.

Cluster detection is the building of models that find data records that are similar to each other. Cluster detection is inherently undirected, since the goal is to find previously unknown similarities in the data. An example of clustering would be customer segmentation based on behavior patterns.

Link analysis follows the relationships between records to develop models based on patterns in the relationships. It is useful in analyzing relationships between customers where the marketing focus is on customers, households and economic marketing units instead of specific components. A well-known example of link analysis is the MCI "friends and family" promotion which was based on customer relationships.

Decision trees and rule induction are very important mining techniques used today. Decision tree techniques include classification and regression trees (CART) and chi-squared automatic induction (CHAID), among others. Used for classification, these techniques divide the records in the training set into disjoint subsets, each of which is described by a simple rule on one or more fields. Decision trees are widely used and can be easily explained based on criteria used to divide data onto the limbs of the tree.

Artificial neural networks are simple models of neural interconnections in brains, adapted for use on digital computers. A neural network learns from a training set, generalizing patterns inside it for classification and prediction. Once "trained," the neural network operates on very large volumes of data in a fraction of the time it would take for a human to do the same work. Neural networks are widely used in fraud detection activities.

Genetic algorithms are the most futuristic of data mining techniques. They apply the mechanics of genetics and natural selection to a search used for finding the optimal sets of parameters that describe a predictive function. They are fairly "Darwin-istic," since they involve the "survival of the fittest" parameters.

Technology Requirements

No matter what tasks miners are attempting or which techniques miners typically use, they utilize the data warehouse itself or data marts in their work. In addition, they would be the primary users of an exploration warehouse, if one exists. The database design scheme that best serves the needs of miners is highly normalized in the data warehouse or data mart environment. Aggregated or summarized data usually is not useful to the miner, since he or she is after detailed data to support trends or patterns. The exploration warehouse, usually a fairly flat file structure which may be preconditioned for analysis, is designed specifically for use by miners. Remember that the top three types of data of interest to miners are detail, detail and detail. The summarization of information so useful to farmers, for example, is of little use to miners, since they are looking for specific instances of data correlations.

How much history do miners need? As much as they can get their hands on. Gold miners would say that the more ore they have access to, the more potential for finding gold. Similarly, the more data miners can get their hand on, the more potential for finding information of value to the organization. Directed types of queries miners run are very analytical and reasonably predictable. While they don't have any idea what patterns they may find, their queries are somewhat anticipated, in that they are looking specifically to predict the size of the balance that will be transferred if a credit card prospect accepts a balance transfer offer, for example.

Mining Tools

Some tools miners use are statistical languages, core data mining tools, query tools and data visualization tools. Statistical languages provide exploratory analysis for expert analysts and statisticians. As with traditional programming, a user needs to understand a "language" ­ a specific syntax needed to extract the data that is needed. In addition to understanding a language, the user needs to understand how to condition data (what to do with missing values, for example), as well as how to structure and manage queries.

Core data mining tools provide exploratory analysis to expert analysts and statisticians in a graphical, user-friendly format. No language or syntax needs to be understood, since a point-and-click interface is usually provided via a desktop, client tool. Core data mining tools discover hidden patterns, trends, relationships and predictive indicators. Even with these graphical tools, however, users still need to understand how to condition data and how to structure and manage queries. A core data mining tool may provide the ability to use multiple techniques (for example, clustering algorithms, decision trees and neural networks); it is incumbent on the miner to understand the difference between techniques and which ones should be best applied to which business situations.

Query tools, which provide easy access to detailed data, can also be used by miners to extract data and look for nuggets. Query tools are directed; that is, they require the end user to have a pretty good idea of what they are looking for. These tools are useful either to develop a particular idea or hypothesis or to prove or disprove a hypothesis.

Data visualization tools present data graphically to improve its comprehension. They enable the understanding of massive amounts of data, or data with complex interrelationships, by utilizing the three spatial dimensions (length, width and depth) in addition to color, brightness, texture, etc., to depict data "dimensions" such as product, channel, geography and time, for example. While data visualization tools are not data mining tools per se, they assist the miner in visualizing factors most predictive of a certain desired situation. They can also help communicate the results of complex mining algorithms (clustering, for example) to people who are not as statistically oriented, since the "picture" tells the story.

Mining is not easy. There are pitfalls in determining what activities need to be done and which techniques and tools need to be utilized. Lots of false starts, dead-end paths and erroneous or meaningless findings can be encountered. But for organizations that stick with data mining, the rewards can be great. Miners can be expert marketers, risk controllers, logistics specialists or statisticians. In any of these roles, miners are looking for that Welcome Stranger ­ the nugget of information that may prove to be a valuable driver for their business. When they do, they have discovered "gold" in "them, th'ar hills." And it's cause for celebration!

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access