The term data mining often raises a variety of questions about the technology, techniques and tools. This article is designed to answer some of those questions.
What is Data Mining?
Quite simply data mining is the iterative process of extracting patterns from your business and customer interactions, everything from bar code swipes to Web site "hits," for purpose of improving your firm's bottom line. Data mining is about answering questions such as:
- Who are my most profitable customers?
- How do I optimize my inventory?
- How do I increase my market share?
- Who are my Website visitors?
- What clients are likely to defect to my rivals?
Above all, data mining is about leveraging artificial intelligence technology toward a strategic objective: competitive intelligence. It's about increasing your market share by knowing who your most valuable clients are, defining their features and then using that profile to target new customers.
Who Needs Data Mining?
As a company you continuously use databases and spreadsheets in your business transactions. As you accumulate and store this data, you are archiving knowledge which can help your company be more competitive in the future. Data mining technology can be leveraged to transform this raw data into a business advantage for your firm. The relevancy of the technology to your company is that it can uncover hidden business opportunities. Data mining is about using pattern-recognition technologies to obtain a competitive advantage and make more effective decisions. It's about making data-driven decisions based on the historical business patterns that you accumulate daily as you interact with your customers.
Statistics vs. Data Mining
Traditionally the goal of identifying and utilizing information hidden in data has been achieved through the use of query generators and data interpretation systems, such as SPSS or SAS, the traditional tools of database analysis. (Incidentally, both SAS and SPSS have developed data mining products to add to their product base.) This involves a user formatting a theory about a possible relation in a database and converting that hypothesis into a query. It is a manual, user-driven, top-down approach to data analysis. The difference with data mining is that the interrogation of the data is done by the data mining algorithm--rather than by the user. In other words, data mining is a data-driven, self-organizing, bottom- up approach to data analysis--whereas statistics are user or verification driven.
Another difference is that data mining also extends statistical approaches by allowing the automated examination of a large number of hypotheses and the segmentation of databases. Data mining has major advantages over statistics when the scale of databases increase in size, simply because manual approaches to data analysis are rendered impractical and do not scale very well as data sets increase in size and complexity. For example, suppose there are 100 attributes in a database to choose from and you don't know which are significant. With even this small problem, there are 100 x 99=9,900 combinations of attributes to consider. If there are three classes (such as high, medium and low) involved, there are now 100 x 99 x 98=970,200 possible combinations. Now consider having to analyze thousands of transactions on a daily basis, as in retail, and it's obvious that manual methods of data analysis need to utilize the power of today's CPUs. Data mining offers a solution to this problem by automating the search for key relevant customer attributes. More importantly data mining has the advantage of being totally unbiased.
OLAP vs. Data Mining
The main difference between OLAP and data mining is how they operate on the data. OLAP tools provide multidimensional data analysis--that is, they allow data to be broken down and summarized (such as by regional sales). For example, OLAP typically involves the summation of multiple databases into highly complex tables. OLAP tools deal with aggregates--OLAP technology basically comes down to the operation of data via addition. For example, OLAP can tell you about the total number of widgets sold in all the ZIP codes in the country. Data mining, on the other hand, is about ratios, patterns and influences in a data set. As such, data mining is division. Data mining can tell you about the factors influencing the sales of the widgets in those ZIP codes. This is not to say that both OLAP and data mining should not be used in conjunction to gain a powerful insight into your company databases, customer information file, data marts and data warehouse. In fact, aggregate and inductive analyses can complement each other. For example, a data mining analysis can discover a significant relationship in a set of attributes. OLAP can then expand on this and generate a report detailing the impact of the discovery.
The Core Technologies
Data mining's origins come from three branches of artificial intelligence--neural networks, machine-learning and genetic algorithms--which are designed to emulate human perception and learning. Incorporated in today's modern data mining tools, neural nets, genetic algorithms and symbolic classifiers are being used to deal with large data sets enabling business users to extract powerful code and business rules for classification of potential new customers and competitive strategies directly from company databases without the overhead of traditional statistics. Most modern data mining tools incorporate either a back-propagation neural network and/or a machine-learning algorithm, such as c5.0, which split a data set into clusters or a set of weights. For example, in a customer database that output might be a field identifying buyers versus non-buyers--with the data being split into subsets by the algorithm or the network, not the user.
Types of Tools
There are various technologies involved in the mining of data. Similarly, there are various formats by which data mining tools extract their results from databases. The following is a brief listing of these tools and tool boxes, which are a new generation of data mining suites that combine several technologies.
These data mining tools incorporate proprietary and machine-learning algorithms such as CART, CHAID, ID3, C4.5 which perform somewhat the same process on a data set. They segment it into statistically significant clusters of classes based on a desired output. Some of these tools also generate "decision trees" which split a database into classes. Almost all produce IF/THEN rules, segmenting a data set into classes which can point out important intervals (ranges) and attributes (features). This group of tools includes:
A neural network is a processing algorithm, the design of which was motivated by the design and functioning of human brain and components thereof. Most neural networks have some sort of "training" rule whereby the weights of connections are adjusted on the basis of presented patterns. In other words, neural networks "learn" from examples. These data mining tools are self-adjusting in that they train themselves on a data set in order to construct a set of weights and a model for classification. Care must be taken in their use, since they do require considerable adjustment of the settings, such as the selection of a topography and the setting of learning rates. They also can only work with numeric data. This group of tools includes:
|BrainMaker||NeuralWorks Professional II/PLUS|
|MATLAB NN Toolbox||PRW|
|ModelQuest||SPSS Neural Connection|
Visualization tools can be used to detect and uncover large patterns hidden within heterogeneous data sources. Visualization tools use abstract representations in interactive, immersive, 3-D, virtual environments to display large quantities of data. Visualization is a method for exploring trends within a database that is usually accomplished by navigating data landscapes and visually orienting the data to reveal hidden biases. Visualization systems are geared toward support of real-time applications, since parametric values can be displayed as an animated or simulated dimension of data. This group of tools includes:
|IBM Visualization Data Explorer||SPSS Diamond|
These are powerful data mining suites, incorporating both neural and polynomial networks, as well as machine-learning and genetic algorithms. These tool boxes recognize that no single technology can answer all the questions from the data, and they are designed to provide a hybrid solution. Built for providing business results, most of them are able to generate C code directly from their predictive models. Most of these modern, high-end data mining suites range in prices from $20,000 to $675,000 depending on platform, scale and number of technology modules selected. Examples of these tool boxes are:
|Clementine||IBM Intelligent Miner|
|IDIS Data Mining Suite||Partek|
|Darwin||Pilot Discovery Server|
|Hyperparallel Discovery||SAS Data Mining Software|
Networks or Symbolic Classifiers?
There are two major types of autonomous data mining technologies and tools: networks and symbolic classifiers, also known as rule-induction or decision tree programs. Both type of tools automatically interrogate the data for patterns and clusters. They both segment a data set into significant groups or classes. Although different in design, both of these types of data mining tools are based on inductive theory (learning from example) and perform somewhat the same process on a database. They partition or classify independent variables on the basis of their relation to a dependent variable, or desired output. These tools are based on a concept known as supervised learning.
Neural networks require an extensive amount of experimentation and testing-- such as the setting of the right number of nodes, stopping criteria, learning rate, momentum coefficients and hidden weights. However, when used in conjunction with a genetic algorithm to optimize these settings, neural nets can be extremely accurate. Another limitation of networks is that they only work with numbers or binary data. The data often must be normalized and scaled, involving pre-processing transformations. The results are in the format of formulas or a set of weights.
Symbolic classifiers such as c4.5 or c5.0 from the field of machine-learning and statistical algorithms such as CHAID (CHI-squared Automatic Interaction Detection) and CART (Classification and Regression Tree) offer a more viable option to data mining when an understanding of patterns is required. For example, these tools can segment from your company data warehouse customers who will buy and those who won't-- based on an analysis of their attributes (features) and intervals (ranges). From a sample of historical transactions, these types of data mining tools can find the key features distinguishing your high-value clients:
IF Customer contacts=2 or 3
AND Recent purchases=0, 1 or 2
THEN Customer Will Buy 74.5%
A major, multinational, four-year empirical study entitled "StatLog (The Comparative Testing of Statistical and Logical Learning Algorithms on Large-Scale Applications to Classification, Prediction and Control)" sponsored by the European Commission compared various state-of- the-art classification systems including statistics, neural networks and symbolic classifiers on twelve large real-world data sets from the fields of image analysis, medicine, engineering and finance. "StatLog" found that there was no single superior method of classification or prediction. Instead, it concluded that the accuracy of any data mining tool, whether it be neural networks, regression statistics or symbolic classifiers, was heavily dependent on the structure of the data set being analyzed. "StatLog" also found that symbolic classifiers were superior in accuracy over neural networks on skewed, nonparametric data sets and databases containing a high number of categorical data fields, such as those commonly found in customer information files:
- MARRIED: No or Yes
- RESIDENCE: Rent or Own
- ZIP CODE: 94501-1122
- PURCHASE LEVEL: Low, Middle, High
- CATALOG NUMBER: 8M79Y
Corporate Use of Data Mining
A recent survey by the META Group found that large corporations were using data mining for strategic planning, to increase market share and to gain competitive intelligence. Large corporations recognize that their transactional data contains untapped knowledge about the company and its customers--information capable of giving them a competitive edge in a market-saturated world. First used by financial services organizations and retailers, data mining is today's secret weapon for virtually every industry. Data warehouses are now being used in conjunction with OLAP and data mining to drive decisions and improve business processes, such as:
- identifying new clients likely to buy their products,
- anticipating demands on inventory,
- predicting customer buying habits,
- mapping market developments,
- calibrating customer loyalty,
- finding perpetrators of fraud; and
- becoming more competitive.
Savvy corporations are beginning to use this intelligence to develop marketing strategies, target mailings, adjust inventories, minimize risk and eliminate wasteful spending based on an analysis of their data. They are increasing the return of their investment on current resources and improving their business advantage.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access