The reason that data alone is not knowledge, but merely data, is it lacks structure, organization, direction, coherence, point or conceptual focus. Just as an interpretation without supporting data would be empty, data without a unifying interpretation is meaningless and leaves the data collector blind. The challenge for the data miner – and what distinguishes the data miner from the data hacker – is to systematically discover and verify meaning amidst apparent disorder and occasionally even chaos.

One important goal of data mining is to discover, define and determine the relationship between variables. These variables, in turn, represent the levers and mechanisms that move business operations. A variety of pitfalls relating to extraneous, hidden and distorter variables can result in misunderstandings and inaccurate conclusions.

For example, people within the same age group who are in hospitals experience a higher death rate than those not in hospitals. Hospital stay is the independent, upstream variable; death rate is the downstream, dependent variable. Age is the candidate control variable, but it is not a very good one. Age is an extraneous variable here. When the relationship between hospital admission and outcome (death) is controlled by diagnosis code, then the relationship is clarified. The death rate for people in the hospital for a given diagnosis code is less than that for people not admitted to the hospital. Diagnosis code is actually the independent variable here.

A hidden variable is exemplified in the case where investment grade corporate bond prices move with the price of Treasury bills. The hidden variable is the market's willingness to incur risk at all. This surfaces when Russia issues a moratorium on payments of its corporate debts, as it did in August 1998 by suspending the convertibility of the ruble, in which case T-bills soared and bonds plunged.

A distorter variable is exemplified by the apparent theory that more married people commit suicide than single people. However, when the population is segmented by age, it is found that in each age group, the single suicides outnumber the married. Thus, this actually supports the thesis that marriage reduces suicide. A distorter variable – in this case age – reveals that the correct interpretation is exactly the opposite of what was suggested by the original data.

One execution of undirected knowledge discovery showed that wealthy customers with children that had graduated from college were starting to draw on home equity lines of credit. One might expect those with children of college age to do so – but why those with children who had graduated? This example might be dismissed as useless or unintelligible. However, further investigation indicated these people were starting home-based businesses. This was an actionable result.

The recommendation? In order to understand the relationship between independent and dependent variables, the data miner must take account of other related variables (test factors) and stratify, control, decompose and associate them with the independent and dependent variables in turn. "There are no spurious relationships – only spurious interpretations," thus, a reflection circa 1968 from an early data miner.

If we don't quite know what we're looking for – as in undirected knowledge discovery – then we might not recognize it when it shows up. The job is to define and progressively refine the business result being sought through data mining. Management judgment, business experience and insight into the dynamics of the market are required to sort these out and to prioritize what anomalies are worthy of further resources, effort and investigation. Plan on leveraging management's understanding of the business and its operations as a data mining resource.

In other words – and this cannot be said enough – if neural networks and genetic algorithms can learn from experience (or at least from an encoding of experience), then management must do so as well. When management learns from experience and makes its experience available in a non- dogmatic, supportive way to the application of the technology to the business, then win/win results are enabled.

This is where data mining is distinguished from data hacking. The data miner has a depth of understanding, experience, and perspective and has systematically organized them; the data hacker has, well, data.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access