The lines that divide business intelligence (BI) and content management are blurring. BI has traditionally been the province of data warehouses, star schemas and numeric data. Content management coexisted with business intelligence in portal applications designed to provide a single point of access for a wide range of information. The problem was, and continues to be, a lack of integration between the two. The situation is understandable. Business intelligence focused on providing information on the aggregate state of an organization while content management focused on collecting and making accessible unstructured assets. Web-based applications replaced their client/server counterparts and eventually led to portals as the linchpin tying BI and content management applications together along with external sources. Now we have well-developed tools in all three areas. Names such as Business Objects, Documentum and Plumtree are as familiar in many organizations as IBM, Sybase and Oracle. The next step in the evolution of these tools is the closer integration of database, content management, business intelligence and portal technologies, and that will be the focus of this monthly column in DM Review.

The first thing to understand about incorporating unstructured content into existing BI infrastructures is that there is no single tool or technique that will meet every need. Instead, a range of applications is now available that tackle the problems of unstructured text from a variety of vantage points. Here are some of the best known, or soon to be best known, vendors and their approaches.

Autonomy sees the challenges of managing unstructured texts as a pattern recognition problem. Rather than try to analyze the content of text, Autonomy's tools break text down into small segments that can be compared, counted and manipulated using a combination of Bayesian inference and information theory. Bayesian inference makes estimates about the likelihood of a fact based upon previously seen data. For example, if a user searches for the term "bank" and most instances of "bank" occur in documents about financial institutions, then it is most likely that a user searching for that term is interested in finance and not river banks. Information theory provides the basis for determining how much information can be conveyed in a message or, in our case, a document. Statistical tools, such as Autonomy's, do not depend upon any language- specific knowledge.

InXight's LinguistX Platform uses information about the structure and properties of language to analyze text. The LinguistX Platform is used in other InXight products such as Thing Finder and Categorizer as well as Oracle's Open Text (formerly Oracle interMedia Text. Statistical techniques are still used even in language-based tools, but they are not the sole means of analysis. The benefit of this is better precision and recall when searching for content because known rules about language in addition to pattern analysis are used to disambiguate terms and measure similarity. For example, a search for the "Society for Archeology" should not return the "Society for Architecture" as a similar match simply because of shared sequences of letters. The downside is that linguistic analysis is more complex than pattern recognition so processing times are longer in the former type of tools.

Megaputer's TextAnalyst supports information retrieval like Autonomy or Oracle Open Text, but its most distinguishing feature is its navigation. TextAnalyst allows users to find key terms and their relationships to other terms. For example, while conducting a competitive intelligence analysis of a competitor's patent portfolio, a user could quickly see the relationship between key technologies by examining the co-occurrence of representative terms.

Autonomy, InXight and Megaputer all use different approaches to analyzing text but they all work from the same basic principle – there is a discernable pattern in text that corresponds to its information content. By analyzing the text and making those patterns explicit, one can develop more effective information retrieval processes. In the case of Autonomy, the patterns sought are statistical, InXight exploits a range of linguistic patterns and Megaputer bases its analysis on morphological preprocessing and neural-net processing. Clearly, the structure of text can be found and manipulated in a number of ways. For us, the questions are which techniques work best in which situations, what are the performance implications, how well do these techniques operate and can we combine several techniques to offset the weaknesses of individual methods? Of course, these questions are all derived from the one question that really matters: How do we deliver the information to decision-makers when they need it and in the form they can use?

Business intelligence, content management and even knowledge management are overlapping domains without firm boundaries. Data warehousing has primarily focused on structured data, but that is changing as Richard Hackathorn documented in his article "The State of the BI Marketplace" (DM Review, April 2001). "Document Warehousing and Content Management" will examine one aspect of the changing nature of data warehousing: the inclusion of unstructured text into the business intelligence arena. Next month's column will examine how one data warehousing vendor, Oracle, is addressing the need for content with the new Ultra Search application available in Oracle9i.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access