As enterprise information portals (EIPs) grow to truly enterprise levels, finding relevant information becomes more challenging. Searches of large content repositories, whether we are talking about the Web or an EIP, often suffer two related problems. First, users' queries often return irrelevant information because the chosen keywords have multiple meanings. A second problem with keyword searching is that it assumes users know the terms that reflect the information they are looking for, a questionable assumption especially when researching new topics. There is no single solution to providing high-quality universal searches across an enterprise. We will do better to look for improvements by combining tools rather than trying to squeak out marginal improvements from a single technique. Taxonomy-generation tools, for example, complement search engines and should be considered the second step toward providing high-quality searches in an enterprise information portal.

Taxonomy-generation tools create a Yahoo!-like directory structure for navigating content in a portal or intranet. The process of categorizing content with a taxonomy can start with a predefined set of categories such as those found in an industry thesaurus or an internal organization structure. Some tools, such as Verity's K2 Enterprise, provide automated methods for initial taxonomy construction based on hierarchical clustering, meta data extraction and other techniques. Key terms are associated with categories and provide the link between content and their place in the taxonomy. The final, and recurring, step is analyzing content to determine the most relevant terms and placing the document into the appropriate place in the taxonomy.

With a taxonomy in place, users will have an easier time finding information. First, users do not have to come up with keywords to find information. Someone looking for best practices in quality control will not need to know specific terms about a manufacturing process or statistical measures ­ he or she only needs to know enough to drill down through a set of choices to find what he or she needs. A second benefit is that taxonomies can categorize all content in a portal or intranet. There are limits to the throughput of categorizers, but crawling the most important areas of an intranet will ensure the taxonomy indexes the most relevant documents and content. This, in turn, will improve the chance users will find what they are looking for in their searches.

It's now time for some truth in advertising. Automatic taxonomy-generation tools are not so automatic. Categorizations vary in accuracy. Scalability is a concern. Improvements in the quality of the taxonomy will take time and require methodical evaluations and adjustments. Let's look at these individually.

Creating a taxonomy requires initial investment in defining the basic taxonomy structure. With the structure in place, the terms that link categories to documents will almost certainly require editing. For example, documentation about a product could end up in the same place as promotional material about the item even though the taxonomy has separate categories for technical documents and marketing material. To prevent this type of misclassification, the categorization rules need to be revised to include filter terms that discriminate between technical and marketing information.

Categorization is rarely a black or white proposition. Weights associated with assigned categories reflect the relative confidence that the content actually falls into that category. This information is essential for establishing business rules for managing a taxonomy. For example, any document placed into a category with a weight greater than 0.8 is automatically published, anything below 0.5 is rejected and everything in between is sent to a human for a decision. One approach to improving the quality of results is to explicitly support business rules and workflow in the categorization process. Another technique is to use multiple categorization algorithms to avoid the particular shortcomings of individual algorithms.

Any enterprise-class tool must scale, and taxonomy tools are no exception. One possible bottleneck is the clustering algorithm. Fortunately, a new breed of algorithms based upon support vector machines (SVMs) is offering much faster categorizations than some of the more traditional techniques. In addition, taxonomy tools should allow for incremental additions of documents without having to reanalyze content already categorized.

Finally, we need to remember Deming's warning about quality: if we do not measure it, we cannot improve it. The quality of a taxonomy is measured by how well it categorizes content compared to a knowledgeable human. To improve the quality of the taxonomy and its categorization of content, we need to methodically sample and evaluate what has been automatically categorized, especially after significant changes to the categorization rules.

The taxonomy-generation market is relatively young; but companies such as Semio, Stratify, Quiver and SmartLogik are offering products with a range of functionality. Portal and search vendors, such as Verity, are incorporating taxonomy tools into flagship products. It is not clear what this market will look like a year from now. However, one thing is certain – taxonomies are as essential as search engines in enterprise information portals.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access