Continue in 2 seconds

Document Warehousing & Content Management

  • October 01 2001, 1:00am EDT

Oracle has unleashed the marketing machine to trumpet the virtues of Oracle9i. While improvements such as multiple block sizes and more dynamic parameters may make DBAs' lives a little easier and performance improvements in partitions, function-based indexes and materialized views are welcomed by data warehouse designers, a significant improvement has been made in the content management arena. A new tool, Oracle Ultra Search, provides developers dealing with distributed content the equivalent of an extract, transform and load tool for unstructured text.

Ultra Search offers the ability to dynamically monitor several different types of unstructured content sources and catalog meta data about documents in a centralized repository. In its most basic form, it is akin to a Web search engine for the enterprise. Users specify content sources (either the Web, file systems, e-mail servers or databases), how often to check the source and a few other parameters to control the depth of search and the file types to examine. The Ultra Search crawler then compiles meta data about the document and stores that information in a database. (The content itself stays in its original repository, so this tool should not be considered a traditional document management system.) Of course, Oracle provides full text indexing on documents through Oracle Open Text.

So, what's the big deal? Crawlers have been around almost as long as the Web, the content management space is filled with vendors offering Web-based tools to manage intranets and search tools such as Autonomy are available for enterprise-scale operations. The significance of this tool, and others such as IBM's Enterprise Information Portal and InStranet's InStranet 2000, is that the incorporation of such an essential tool for distributed content means that unstructured content is recognized as an essential element of enterprise information assets that needs management as much as structured data. Relational databases and the repository model are as much fixtures of current development practices as portals and Java 2 Enterprise Edition (J2EE) architecture. To successfully control the information assets of an organization, we need to handle unstructured text in a structured manner – in a relational database with enterprise level tools.

There are a few broad levels for structuring text, or content in general. At the first level, content, such as a word processing document, is treated as a binary large object and stored along with simple identifying attributes such as a document ID. This level of management works best for vertical applications with simple storage and retrieval requirements.

At the second level of structuring, additional attributes about the type of content are gathered. Typically, these include file type, creation and modification dates, author and access control attributes. With this additional information, more flexible retrieval is possible. However, for the most part, we are still dealing with superficial features.

The third level of structuring is the most useful because it gets inside the document to answer the question, "What is this document about?" Third level structures include full-text indexes, thematic or topical indexes, and summaries. We've had tools to solve parts of the structuring problem, such as full-text indexing programs, thesauri for describing relationships between terms and linguistic tools for creating summaries and extracting key features.

Of course, we've had database management systems to manage the output of any of these as well. The problem has been lack of integration. Oracle Open Text and similar tools made significant inroads into integrating unstructured text into OLTP and decision support systems by providing both a storage/retrieval mechanism and content meta data extraction tools, such as theme identification and summary generation programs. That state is analogous to data warehousing five to seven years ago when we had the means to store and aggregate numeric data but few options other than custom programs for extracting, cleansing and loading the data. With tools such as Ultra Search, we are seeing the emergence of content management tools analogous to data warehousing extract, transform and load tools.

This emergence implies two things. First, vendors understand that organizations need to manage unstructured text that is distributed throughout the enterprise, not just what is intentionally published to the intranet in a dedicated content management system. Customer service representatives need access not just to sales records and return merchandise authorizations, but customer e-mails, policy memos and other documents about customers. Users will want functionally related information (e.g., sales figures, product descriptions and marketing material) accessible from a single point. This leads to the second point: Both structured and unstructured data need to be integrated along functional lines. When developing a sales proposal, we do not think linearly. First, we think about numeric measures such as past sales and moving averages, and then we think about unstructured information such as conditions, past contracts and competitor offerings. Decision support and content management systems require integration to support the dynamic way users think about problems.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access