A growing niche in the content management market is a class of tools that can index and organize distributed content across a range of platforms and make it accessible through a variety of methods. These enterprise-class tools support three distinct models of access: search, navigation and collaboration. In this month's column, we will examine the benefits and drawbacks of the first two types of access, the various approaches to implementing these processes and, most importantly, how to evaluate tools in each category. Next month's column will focus on collaboration.
The first access method is the common search technique. With these tools, users discover relevant content by specifying keywords and phrases, and Boolean indicators. We all know how effective this can be. Vendors have developed proprietary methods for improving search relevancy by applying some basic rules about word forms, looking at recurring patterns in text and using other statistical analysis methods. While these techniques help, we can't seem to shake one fundamental problem. Regardless of how we try to search, when we improve the chances of finding all relevant content (increasing recall), we tend to increase the number irrelevant hits (decreasing precision). Similarly, when eliminating irrelevant hits, we tend to miss relevant content.
Consequently, the first step in evaluating the effectiveness of a search tool is to understand its rate of precision and recall and how improving one measure affects the other. The speed of indexing content and query response time are also important factors in choosing an enterprise search tool. If you do not have the time or resources for a detailed comparison of enterprise content tools, skip the search tool evaluation and concentrate on evaluating the navigation and organization components instead. There is more variation among vendors' offerings in this area than in the older and better-understood search arena.
In the navigation and organization group, we find categorizers, taxonomy builders and clustering tools. The benefit of these tools is that they allow users to search at higher levels of abstraction. Categorizers assign predefined labels to content that enable users to search with a small set of labels. It's the categorizer, not the user, that must keep track of all the different ways to describe objects in the organization. With a taxonomy, a customer can find a product by browsing a Yahoo!-like directory without having to guess at distinguishing key words. Clustering brings the added benefit of finding content similar to something the user has already found.
Categorizers and taxonomies either use manually crafted rules or, more commonly now, learn classification rules from examples. There is no single approach to learning from examples that works best in all situations; and vendors are turning to either a combination of algorithms, as in Stratify's case, or to supporting a combination of automatic rule induction with manually crafted business rules, the method adopted by Quiver and Verity.
You should consider three criteria when evaluating categorization and taxonomy tools: the accuracy of classification, the number of examples required to train the categorizer and the speed of training.
The accuracy of categorization depends upon the quality of the training examples and effectiveness of the underlying algorithm. Tool developers cannot control the quality of examples, but they do choose their algorithms. While many of us are more interested in integration, customization and cost, we have to pay attention to what occurs under the hood. Vendors sometimes prominently discuss their algorithms while others tuck the details away in technical white papers. In either case, make vendors provide comparative results between their algorithms and their competitors' algorithms. It is easy for vendors to say they have the best categorizer. Make them prove it.
With enough training examples, categorizers will reach acceptable levels of performance. The question is how many examples are required 1,000 or 10,000? Compiling training examples can be time-consuming, so consider the hidden cost of staff resources when implementing and maintaining categorizers. Also, compare tools with regard to the time it takes to execute the training cycle. This depends both on the number of examples and the underlying algorithm. Again, make vendors provide some concrete numbers.
There is no silver bullet in enterprise content management. No single algorithm works best in all situations, but we are seeing a trend toward hybrid approaches that combine either multiple algorithms or automatic and manual methods. Fully automated methods may not reach necessary accuracy levels without large numbers of high- quality training examples. Manually crafted business rules come with obvious overhead. When evaluating these tools, it is essential to understand the tradeoffs that vendors make, such as higher accuracy at the expense of slower training cycles. Make sure your objectives align with the strengths of the tool you implement.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access