Continue in 2 seconds

Conducting Your First Text Mining Project, Part 1

  • February 01 2004, 1:00am EST

In my January column, I discussed the trend in business intelligence (BI) to support unstructured as well as structured data. Statistical software vendors, long the bastion of structured data analysis, are early proponents of expanding the scope of analysis to include free-form text. Now that we have data mining, text mining and other BI tools at our disposal, how do we get started?

The first step is obvious: identify a business issue that lends itself to structured analysis, such as customer reaction to a new product line or assessment of customer comments in call centers. For the first project, look at quantifiable questions, such as how many customers complained about shipping delays or how many customers commented positively on the quality of their purchase. To keep things manageable, limit the scope to free- form text collected along with structured data ­– survey results, call center records or customer relationship management (CRM) databases.

Let's assume you are working with the CRM database of a consumer electronics manufacturer. Your job is to investigate the types of problems customers have when they first purchase a digital camera. Structured data from warranty registrations provides basic information, such as name and address. This is combined with third-party demographic data to create a broad set of basic customer information. Because you are interested in customers' initial experience and problems they have using digital cameras, the set of customers is limited to those that contacted the call center within 15 days of purchasing the camera.

This brings us to the second step: review the unstructured text available for analysis to identify a fixed set of attributes that can be extracted from the comments. Some of the call center records will include comments from the customers that provide detail not captured by the structured attributes in a CRM system. These are generally short and simple comments such as: "the battery does not last long enough," "flash is erratic" and "outside shots are hazy but inside shots are OK." Once the most relevant topics are identified, map them to yes/no attributes such as "battery problem," "flash problem" and "picture quality problem."

The third step of the process is to identify text patterns that correspond to each of the derived attributes. We can use a number of approaches here, and this is where the art of text mining comes into play. We'll keep it simple and look at two broad approaches ­– a statistical approach and a linguistic approach.

In the statistical approach, we identify a set of positive examples in the CRM database for each attribute, such as "battery problem." We then eliminate commonly used words, known as stop words, from the comments. The remaining words are then statistically analyzed to determine which terms are good indicators of the attribute. The simplest analysis only looks at word occurrences. Other techniques look at word pairs or triplets. For example, from "the battery does not last long enough," we could extract word pairs "battery does," "does not," "not last," "last long" and "long enough." (The word "the" is common and thus is removed; the word "not" is common but carries important meaning and therefore is not removed.) These pairs are called 2-grams or, more generally, n-grams. The same technique can be applied to characters as well as words, although character n- grams typically use several characters.

Whether words, word n-grams or character n-grams are used, the analysis is basically the same. We use statistics to identify patterns that occur frequently in the positive examples and infrequently in the negative examples. When those patterns are found in other records, we assume the attribute (e.g., "battery problem") is present; otherwise, it is not.

Linguistic approaches are different. Rather than treating the text as a string of characters, linguistic approaches identify characteristics of words, such as their part of speech and, to some extent, their meaning. In our example, "the battery" would be tagged as a noun phrase and "does not last" would be tagged as an active verb phrase. Phrases such as "the battery" and "the power system" are treated identically for our purposes; similarly, "does not last" and "dies" are equivalent. The rule in this approach is that when a power system phrase appears near a poor performance phrase, the record is flagged as having a battery problem. This level of analysis may be overkill for many problems; however, when text is long and covers two or more topics, linguistic approaches can render more precise distinctions than statistical approaches alone.

The final step is to apply data mining techniques to the expanded set of structured attributes: the originally structured attributes and those derived from free-form text.

When undertaking your first text mining project, keep these basics in mind. You just might find the process isn't all that different from the analysis you do today. Next month, I will discuss tools that provide the capabilities described here.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access