Shahla Butler would like to thank her colleagues, Tricia Spencer and Troy Haines, for contributing this month's column.

Information technologies have historically operated on numbers: from generating monthly sales reports to performing multidimensional analysis to, most recently, applying sophisticated data mining and data visualization algorithms to predict customer behavior and discover previously unknown information.

There has been good reason for this emphasis on numerical data. Numerical data provides the vital signs of an organization, and formal statistical analysis (and theory) is based entirely on numerical themes. However, this emphasis on numerical data and numerical data analysis is changing. Today we recognize that data is not just numbers. Most data within an organization (up to 90 percent) and certainly externally is in the form of non-numeric data. Letters from customers, voice recordings, image and video libraries, technical documentation, graphic attachments to word processing documents and presentations, archived e-mail and electronic discussions are increasingly making up the bulk of data being stored and maintained. One only has to look at the Web to see that its content is predominantly unstructured text, image, audio and video data.

Recently there has been an increasing interest in analyzing non-numeric data by applying techniques from data mining and artificial intelligence to solve real-world business and consumer problems. Every one of us has experienced the poor performance of search engines on the Web. A significant amount of work being performed today attempts to improve the accuracy and relevance of results returned from simple queries. There is also the knowledge management (KM) community pushing very hard for solutions that help organize information (mostly non-numeric) and aid in retrieval of strategic information. Information retrieval based on the content of a document, image or video is the direction we are heading.

Then there is the business intelligence (BI) and analytical piece of the puzzle. Does non-numeric data contain hidden, previously unknown information that will yield competitive advantages to those capable of efficient extraction and use? What information about a customer can be gleaned from their customer service records and informal correspondence? Can this information be combined with more traditional analysis that is being performed ­ for example, estimating a customer's propensity to purchase a product? These are certainly open questions.

You may ask if non-numeric mining has such high potential value, why aren't we already using it? With any emerging technology, there are daunting challenges that can slow the technology's adoption. Many of the issues encountered in data mining (i.e., mining numeric data) apply to non-numeric mining and must be dealt with before industry hype leads to inflated expectations.

Currently, both numeric and non-numeric mining products make assumptions about the input data format that result in significant pre-processing time. Many organizations probably have knowledge captured in many diverse formats, and data manipulation will be required to prepare the input data for further analysis. A tougher problem is that of standardizing the meaning of the data. Just as in elements of a data warehouse, bringing in text from multiple sources guarantees that terms will be used in different ways with different definitions. The meaning of the input must be standardized; i.e., meta data must be defined.

On the output side of the equation, non-numeric mining can result in unintelligent result sets. Mixed in among a volume of possible results, there will be a few key findings that will be strategic and important. In the analytical context, the need for interactive visualization and sorting mechanisms ­ for example, by level of "interestingness" ­ is necessary in order to navigate and interpret the results to truly gain the insight that can provide business value.

Realizing the full benefits from mining technologies is hard work. The reality is that the current state of most tools requires analytical skills to understand the modeling process, business skills to interpret and operationalize the results, and database management skills to understand the strengths and limitations of the data. We see this requirement applying to both numeric as well as non-numeric data mining.

Non-numeric data mining is the next new "hot" area of data analysis; and in the next two to three years, we will be seeing an increasing number of successful analytical applications driven from non-numeric data. Text mining is already grabbing considerable attention and will probably be the first widely successful application of non-numeric mining. The potential impact of non-numeric mining on the information technology landscape is tremendous, especially when it begins showing up on the consumer Internet and Web space as well as corporate knowledge management intranets.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access