Business intelligence and statistical analysis techniques are running out of steam. Or at least that appeared to be the case.

Fireman's Fund Insurance Company, for example, tried a wide range of analytic techniques to understand rising homeowner claims and suspicious auto claims, but could not find predictive patterns in the data. The insurance company's team of analysts, led by one of the authors (Ellingsworth), realized the problem was not with their techniques, but with their data. The analysts were dealing with new types of claims that were not fully described by the structured data collected by the company. Fortunately, the additional information was available in adjuster notes and other free-form texts.

To satisfy the accuracy needs of the modeling programs, the company used basic text mining techniques to isolate new attributes from the text and then combined those with previously available structured data to expand the total amount of relevant usable information. The thinking was that if business intelligence techniques seem inadequate, one should just build a better mousetrap. Fireman's Fund subsequently discovered that success might just mean paying closer attention to the supply chain of information where basic data features originate.

In this article, we will describe a basic text mining technique, term extraction, and discuss how it was successfully used at Fireman's Fund to gain insights into urgent business problems. We will also provide some tips that may be of value when introducing text mining to your own organization.

Term Extraction

Term extraction is the most basic form of text mining. Like all text mining techniques, this one maps information from unstructured data into a structured format. The simplest data structure in text mining is the feature vector, or weighted list of words. The most important words in a text are listed along with a measure of their relative importance. For example, consider the following hypothetical claims adjuster notes:

"The claimant is anxious to settle; mentioned his attorney is willing to negotiate. Also willing to work with us on loss adjustment expenses (LAE) and calculating actual cash value. Unusually familiar with insurance industry terms. Claimant provided unusual level of details about accident, road conditions, weather, etc. Need more detail to calculate the LAE."

This text reduces to a list of terms and weights as shown in Figure 1.This list of terms does not capture the full meaning of the text, but it does identify the key concepts mentioned. To identify key terms, text mining systems perform several operations. First, commonly used words (e.g., the, and, other) are removed. Second, words are stemmed or replaced by their roots. For example, phoned and phoning are mapped to phone. This provides the means to measure how often a particular concept appears in a text without having to worry about minor variations such as plural versus singular versions of words.

Figure 1: Example List of Terms and Weights

The final step calculates the weight for each remaining term in a document. There are many methods for calculating these weights, but the most common algorithms use the number of times a word appears in a document (the term frequency, or tf factor) and the number of times the word appears in all of the documents in a collection (the inverse document frequency, or idf factor).1 In any event, large term frequency factors increase the weight of a term while large inverse term frequency factors lower the weight. The general assumption behind this calculation is that terms that appear frequently in a document describe distinguishing concepts unless those terms appear frequently across all texts in the collection.

For another example, consider a workers' compensation claims system. As with other insurance applications, this would track demographics about claimants, location of the accident, type of accident, etc. It may also include Boolean indicators for common conditions involved in past claims, such as slippery floor; but there are practical limitations to the number of such indicators – therefore, free-form text is used for additional details.

Narratives could be used to describe activity prior to the accident, unusual environmental conditions, distracting factors, etc. Term extraction could identify key terms in each narrative (e.g., turning, bending, twisting prior to the accident; leaks, ambient temperature, wind conditions in the environment conditions notes; and noise, foot traffic and other distracting factors in the final narrative). By mapping the free-form text to a feature vector, the text is modeled in the same attribute/value model used by structured data and thus lends itself to analysis using traditional business intelligence tools such as ad hoc reports, OLAP analysis, data mining and predictive modeling.

Applications of text mining are not limited to claims processing. Many business transaction applications, such as customer relationship management, e-mail responses, clinical records and enterprise resource planning (ERP), include both structured data (such as numeric measures and coded attributes) and free-form annotations. CRM systems may track detailed descriptions of customer complaints, doctors may note variations in symptoms or special instructions in a patient's chart and ERP systems might track notes on problems in production runs. Free-form notes are used frequently because we cannot always determine all the attributes relevant to a business process.

In some cases, relevancy changes with time. When suits were brought against Firestone for faulty SUV tires, Fireman's Fund turned to free-form text analysis to determine if any of their claims related to the litigation. Unpredictable cases such as this are candidates for text mining-based analysis.

Fireman's Fund Matches Techniques to Problems

Mastering information is a critical competency for success in the insurance industry. As part of an internal consulting group, Ellingsworth is often faced with making new headway on old problems. These problems typically take the form of making predictions about expected claims and understanding why outcomes vary from those predictions. Only in understanding why the outcomes are unmatched can they craft a set of alternative management solutions.

Text mining helps the Fireman's Fund in at least these three ways: extracting entities and objects for frequency analysis; identifying files with particular attributes for further statistical analysis; and creating entirely new data features for predictive modeling. The first method was used in the Firestone case.

The second method was used when the insurer saw the cost of homeowners' claims soaring in a single state. When the traditional reports failed to provide clarity, the frontline staff was polled to provide suggestions. They indicated that a new type of claim was emerging which involved mold. The effect trailed the occurrence, meaning that by the time it became a serious issue, many cases were already on the books.

Once the company realized the potential liability, it began to examine past claims in an effort to identify claims that required central tracking. Unfortunately, no structured code existed for categorizing and tracking mold risk. The level of effort required to manually examine cases from the prior two years to tag them for this risk was unreasonable. However, by using a handful of known examples, analysts identified patterns in claims using text mining techniques and were able to search for additional files with those patterns. This first-pass filtering was not perfect, but it did yield a much smaller list of files that could be manually coded. While pattern matching based on unstructured data works in some cases, other business problems require more integration of structured and unstructured data.

Some of the Text Mining Tools and Vendors

For more information on commercially available text mining tools, consult:

SAS Text Miner

IBM Intelligent Miner for Text


Insightful Miner for Text

Megaputer Text Analysis

Analysts with Fireman's Fund ran into a wall when trying to build a model to predict suspicious claims in third-party automobile accidents. After modeling with all the available structured data, the models were only marginally useful, and the team was desperate to try new approaches. During a test and validation iteration, analysts observed an interesting phenomenon. Investigators were reading the claim file in order to further categorize cases identified by the model. Then the investigators assessed the behaviors of the claimants and the facts of the claim scenario. This led to the notion that specific recurring themes in the story of the claim were their triggers for further research. That behavioral set prompted the analysts to realize that those features had to be exposed and added to the modeling process. The result was a model that could identify useful referrals that would be kept up to date as new information was added to the files in unstructured form over the life of the claim.

Lessons Learned

Text mining has succeeded at Fireman's Fund because they focused on business fundamentals. If you are hitting the wall with structured data analysis, consider these tips.

First, focus on enhancing the gains of high economic value projects that are already in place. Marginal improvements through the intelligent use of unstructured data can improve ROI. With these near-term identifiable wins, you can fund further research.

Second, consider which projects failed due to lack of detailed data. Can text mining and term extraction in particular create useful data features that allow you to discover heretofore unknown analytical insights?

Third, remember the keys to success in any information technology project: people, process, technology, philosophy and environment. This is a specialized area, and few organizations are equipped with the right talent to succeed without investing in the ongoing education of their business intelligence analysts (assuming they have them). The processes of information extraction and text categorization are supported by many software vendors. However, the creation of company-specific resources, such as a robust predictive taxonomy, requires at least several iterations with subject-matter experts and automated tools.

Fourth, look for approaches that embed ongoing feedback. Such feedback provides a chance for continued improvement and also permits monitoring for drift in vocabulary and for detecting new topics of interest.

Finally, watch for key indicators of projects to avoid. These include:

  • Lack of an executive sponsor.
  • Lack of a method to show the value to the sponsor.
  • Lack of in-house resources.
  • A determination to "do it all yourself."
  • Fear of finding a qualified consulant

Text mining is a powerful technique for expanding the range of data we can analyze. Often, the information we need to understand a business process is available to us; we just are not looking in the right spot for it. As Fireman's Fund has shown, text mining complements existing techniques. Solutions to apparently impenetrable problems are found when both structured and unstructured data are used. Sometimes you need more than just a better mousetrap – you need better mice.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access