Automatic Classification: Moving to the Mainstream
Information Management Magazine, April 2003
This is the third in a series of articles discussing various aspects of unstructured data.
Advertisement
As depicted in Figure 1, a number of "content intelligence" techniques are available to work with semistructured documents. The last two techniques in the list, basic search and advanced search (both discussed in previous articles of this series), have been broadly integrated into enterprise applications.

Figure 1: Content Intelligence
Classification systems are the next frontier because just as a schema structure represents the contents of a database and enables queries, an information hierarchy or taxonomy organizes a repository of semistructured documents. This lets the user navigate a hierarchical structure of categories analogous to the folder systems of Microsoft Windows and the Macintosh OS, which research has shown to be highly effective for locating information.
Classification has long been used in libraries as a way to organize books, periodicals and other texts, as well as for organizing technical collections. However, in 1995 classification leapt into public view. That year, Yahoo! introduced a Web site, later known as a Web portal, which organized a broad variety of information into categories. Soon, portals covering every conceivable subject area appeared. Yahoo!'s directory of Web sites, unlike many of the other portals, spans thousands of categories and employs teams of human editors who manually classify news and information. For most organizations, this manual approach is economically unfeasible. This, in turn, has stimulated a flurry of research into automatic classification methods. This research has resulted in a new generation of technologies and products, and some of the leading vendors offering these products are identified in Figure 2.

Figure 2: Vendors of Automatic Classification Systems
Yet the question remains: Will this new generation of products be able to carry automatic classification into the mainstream enterprise market? Before we attempt to answer this question, let's first look at the latest technology. We'll then examine several representative applications as well as the tools used to construct them.
From Words to Concepts
The goal of all text classification is to assign documents into one or more content categories. While categories are generally predefined, they may be automatically generated based on the content. Any type of document containing text can be classified, including traditional documents such as reports and memos as well as e-mails, Web pages, call-center notes and other less traditional types. Classification is either performed on a document repository, such as a library, or operated on a stream of incoming documents, such as that which might arrive from a news agency or field sales staff.
On the technology front in the past few years, vendors have brought the ability to extract and classify concepts, rather than words. This has required considerable language processing. As illustrated in Figure 3, words are first stemmed; that is, they are reduced to their root form. Next, stop words are eliminated. These include words such as a, an, in, and the – words that add little semantic information. Then, words with similar meanings are equated using a thesaurus. In the example, the words IBM, Big Blue, and International Business Machines are treated as equivalent.

Figure 3: Classification Example
Finally, the classifying tool will use statistical or language-processing techniques to identify noun phrases, or concepts, such as "Polaris missile" or "red bicycle." In the example, six noun phrases are identified. Further, using a thesaurus or lexicon, these noun phrases are reduced to three distinct concepts that will be associated with the document. In the example, there are three instances of IBM, two instances of acquisition and one instance of Widget, Inc.
Approaches to Classification
As illustrated in Figure 4, there are four main approaches to classification:

Figure 4: Text Classification Approaches
(Figure courtesy of Stratify Inc.)
Manual: Often used in library and technical collections as well as in call centers and forms- processing environments, manual classification requires individuals to assign each document to one or more categories. These individuals are usually domain experts who are thoroughly versed in the category structure or taxonomy being used. Manual classification can achieve a high degree of accuracy – although even domain experts will occasionally disagree on how to categorize a document. However, manual classification is more labor-intensive and therefore more costly than automated techniques.
Rule-Based: In this form of classification, keywords or Boolean expressions are used to categorize a document. This is typically used when a few words can adequately describe a category. For example, if a collection of medical papers is to be classified according to a disease, then a medical thesaurus that lists each disease together with its scientific, common and alternative names can be used to define the keywords for each category.
Another example involves e-mail systems which typically provide rule-based methods for routing messages to specific mailboxes. The e- mails are routed based either on the sender's name or the occurrence of specific words in the subject line. For example, the occurrence of the word "remove" would cause the sender's name to be dropped from an e-mail list.
Page 1 of 4.






