Enterprise content management (ECM) is a widespread domain that covers document management, information retrieval and portals. While these are the most widely recognized elements of ECM, a fourth thread, information extraction, is beginning to emerge.

Information extraction is the process of identifying essential pieces of information within a text, mapping them to standard forms and extracting them for use in later processing. At this point, information extraction tools work best finding the names of persons, places and things; dates and times; and monetary amounts within single documents. These elements, collectively known as named entities, are mapped to a standard form so that their relative frequency in the document can be determined. For example, a news article with "George Bush," "George W. Bush" and "Bush" would result in a named entity "George W. Bush" occurring three times. The relative frequency of these terms is then used to determine the most important named entities in the document. Because the basic operation of information extraction is looking for patterns, the same techniques can be used with a number of applications.

There are basically four reasons to perform information extraction: improve information retrieval, extract structured data elements from unstructured text, reformat content and mine text.

Enterprise-scale search engines allow users to specify criteria based upon a fixed set of parameters (date of creation, author and category). Some of these parameters, such as creation dates, are easily determined during indexing; and some, such as category, can be determined by a statistical analysis of text. While categorizers can identify merger and acquisition (M&A) news stories with reasonable accuracy, they cannot pinpoint M&A stories about deals worth more than $50 million. This is where categorization plus information extraction is needed to reach the next level of precision in searching.

When dealing with unstructured texts of limited scope, such as customer e-mails, resumes or financial reports, information extraction techniques can identify and tag typical pieces of information. For example, customer e-mails often contain information about a product, price, delivery date and billing; resumes have contact information and educational history; financial reports contain company names, dates and boilerplate text that identifies sections of structured reports such as SEC 10-K reports. Information extraction techniques can identify, to varying degrees of accuracy, these recurring types of information and map them to a relational format suitable for use with ad hoc query tools.

Another application is reformatting content. HTML content can be mapped to XML schemas. This would be especially useful for static HTML which contains both data and formatting information.

Text mining, the process of detecting patterns within and across text documents, depends upon information extraction techniques. By identifying key entities in a text, one can find correlations between terms and identify unsuspected links between related topics. For example, the connection between migraines and magnesium deficiencies was discovered by applying text-mining techniques to abstracts of online medical articles. This type of text mining is especially relevant to research-intensive industries such as pharmaceuticals and genomics. In many organizations, data- and text- mining techniques can be used in combination to analyze databases that use both coded data and free-form text. From CRM to electronic patient records, notes fields are used to save relevant but unusual or unanticipated information that does not neatly fit into coded data elements. This is exactly the type of information we want to find; and without text-mining techniques, we'll miss it.

Here are a few things to keep in mind when considering information extraction tools. First, statistical techniques are not sufficient for high-quality information extraction. These tools need gazetteers and other databases of information about names of persons and places. Specialized dictionaries, sometimes called authority files, might be needed to support industry-specific terminology. Some tools are incorporating WordNet, a publicly available lexical database of English from Princeton University, to improve semantic analysis. Choose a tool that is flexible enough to adapt to the demands of your domain.

Second, tools in this category vary widely in functionality. Visual Text, from Text Analysis, is an integrated development environment for developing rule-based information extraction and assumes the user has at least passing familiarity with parsing techniques. Megaputer's TextAnalysis is best for text mining across a wide range of documents. ClearForest's ClearTags is designed for marking-up unstructured texts. Whiz Bang Labs develops custom information extraction solutions. Understanding your functional requirements should make a tool choice clear.

Finally, this is an emerging industry. With the exception of IBM and Insightful, most of the offerings in this area are from vendors specializing in information extraction. This offers the potential for some cutting-edge technology from young, nimble firms. However, but they have not had time to develop the kind of track record that some customers require.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access