The stereotypical enterprise content management (ECM) application combines elements of document management, Web content management, search and taxonomies – but that is about to change. These techniques are sufficient if you are interested in information retrieval, or simply identifying and presenting a set of articles, documents and Web pages about a particular topic to a user. In many cases, however, solving the information retrieval problem still leaves the user with an unmanageable amount of data. Many of us in ECM expend a great deal of effort developing techniques and domain-specific heuristics to improve the effectiveness of our information retrieval applications. However, these efforts will never address one fundamental and growing problem: even if we correctly retrieve only relevant content, there is still too much information for users to analyze. The next step in the evolution of ECM is the adoption of information extraction techniques which provide users with distilled information, not just documents.

Consider the problems in medical research and bioinformatics. Technical advances in experimental instruments in these fields have created vast amounts of new information that is published in scientific journals. Much of the information is available online from sources such as Medline, a database of scientific abstracts. With sophisticated search techniques, users can find the abstracts relevant to their work; but they are still left with the task of culling through those documents to find particular pieces of information, such as protein X activates Y and molecule A binds to B at location C. Information extraction techniques that identify patterns such as these allow us to create structured representations of the relationships between objects, such as proteins and genes. Once we have structured representations, we can apply many of the same analytic techniques that have been used in decision support and business intelligence, such as visualization and link analysis.

The most recent KDD Cup data mining competition sponsored by the Association for Computing Machinery (ACM) posed a problem dealing with mining facts from biological research and indicates the need to address both structured and unstructured data sources with analytic techniques. The fact that two commercial ventures, ClearForest Corporation, the winner of the competition, and Verity, an honorable mention, finished in the top ranks along with academic researchers demonstrates the commercial availability of state-of-the art information extraction.

With a database of facts extracted from text, the tasks performed by researchers and other knowledge- workers change. Users no longer search for documents; they search for connections between facts. With a fact database, one can search for a series of links between two entities: for example, A causes B which inhibits C which lowers levels of D. The discovery of the relationship between dietary magnesium deficiencies and migraines was found using such a method with information extraction techniques and medical research abstracts. Of course, this approach is not limited to scientific work. Law enforcement, government agencies and financial services have all used information extraction to manage and analyze large volumes of unstructured data.

Will information extraction provide an adequate return on investment for your organization? To answer that, consider three factors.

First, information extraction techniques work best with a reasonably homogeneous set of documents, such as scientific abstracts, patent applications, news stories and SEC filings. The specific topics can vary widely even within these limited groups of content, but the types of information extracted are well focused. From scientific abstracts, we can extract relationships between chemical compounds or anatomical relationships. From patent applications, we can find researchers and related patents. From news stories we can find companies, information about earnings reports and new business relationships.

Second, the payoff with information extraction comes when the volume of information is too great to manage manually and the cost of missing information is high. Missing a change in a competitor's sales promotion is not nearly as important as missing an experimental result that could save the cost of early drug trials.

Third, information extraction techniques are not perfect. Facts will be missed, particularly when the text is linguistically complex as that in many scientific papers. Erroneous facts may be extracted as well. Getting high-quality results from information extraction programs often requires human review and editing.

As we develop better tools and techniques for retrieving information, we realize that even if we had a perfect search tool, we would still have too much information to process. The next step in the evolution of enterprise content management is underway in a couple of vary narrow domains; however, as the tools mature and the need grows, expect to see a wider adoption of information extraction.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access