Editor's Note: This article is featured in the 2001 Resource Guide, a supplement to the December issue of DM Review.
Although data warehouses are widely adopted, most fail to tap the business intelligence potential of text. To date, the focus has been on developing data warehouses geared to support primarily numeric data, and the payoff has been enormous. Enterprises now have at their disposal a suite of proven practices and methodologies along with mature tools for number-centric data warehousing. It is now time to focus on the business intelligence value of text and the role of text mining techniques in harnessing this relatively untapped source of business intelligence.
Why Bother with Text?
There are two primary reasons to take on the challenges of text for business intelligence. First, there is far too much critical information that remains inaccessible in documents. Business intelligence systems driven by data warehouses excel at telling us what happened when, but they are not very good at answering why. We can easily discover that a product's sales margins decreased by 15 percent in the last quarter in the southeast region without knowing the cause. Did a competitor release a higher quality, lower price alternative? Were the margins sacrificed on this product as part of a cross- selling campaign? Did the manufacturer license another distributor in the southeast, thus creating new competition?
The answers to these and other questions are buried in documents ranging from e-mails, status memos, news stories and press releases to complex documents such as marketing campaigns, contracts, regulatory agency filings and government reports. To extend the depth of business intelligence, text must be considered.
The second reason to address text directly is that traditional document and text management tools are inadequate to meet the demands of business intelligence. File systems provide crude searching and pattern matching utilities. Document management systems work well with homogeneous collections of documents but not with the heterogeneous mix that knowledge workers face every day. Even the best Internet search tools suffer from poor precision and recall. (Precision is a measure of how many documents returned from a search actually meet the intended query criterion. Recall measures the percentage of documents returned versus how many should have been returned.) Finally, documents are spread across platforms in different formats and languages with little useful meta data about the content of the documents. This same type of dispersion of data is a driving factor in the development of many data warehouses. Business intelligence users need, and have become accustomed to, an integrated view of their organization without regard to the original source or distribution of the raw data. Logically, text is just another medium for conveying information and, thus, belongs within the realm of business intelligence systems.
However, text is different. It is not structured like the numeric measures we are accustomed to dealing with. Or is it? Although text is often described as unstructured, that is far from the truth. Language is richly structured at multiple levels as linguists have aptly discovered. Structural principles are found in the formation of words (morphology), the creation of grammatical sentences (syntax) and the representation of meaning (semantics). Even higher levels of structure can be found in discourses and conversations as described by speech act theory. If we can analyze the structure of language, we can extract the information conveyed by text. Fortunately, after decades of foundational work in computational linguistics, tools are now available to delve into the complex structures of text and extract vital business information.
Text Mining: The Basics
Text mining is the study and practice of extracting information from text using the principles of computational linguistics. Certainly, AWK, grep and other pattern matching tools can extract information from text files, but these do not fall within the realm of text mining tools. For our purposes, the key areas of text mining include:
- Feature extraction
- Thematic indexing
These four techniques are essential because they solve two key problems with using text in business intelligence: they make textual information accessible, and they reduce the volume of text that must be read by end users before information is found.
Feature extraction deals with finding particular pieces of information within a text. The target information can be of a general form such as type descriptions or business relationships. Identifying Alpha Industries as a corporation is an example of the former, while Alpha Industries, a wholly owned subsidiary of Beta Enterprises, and Margaret Johnson, president and CEO of Gamma Group, Inc., are examples of business relationships. Feature extraction can also be pattern-driven. For example, applications analyzing merger and acquisition stories may extract names of the companies involved, cost, funding mechanisms and whether or not regulatory approval is required.
Thematic indexing uses knowledge about the meaning of words in a text to identify broad topics covered in a document. For example, documents about aspirin and ibuprofen might be both classified under pain relievers or analgesics. Thematic indexing such as this is often implemented using multidimensional taxonomies. A taxonomy, in the text mining sense, is a hierarchical knowledge representation scheme. This construct, sometimes called ontology to distinguish it from navigational taxonomies such as Yahoo's, provides the means to search for documents about a topic instead of documents with particular keywords. For example, an analyst researching mobile communications should be able to search for documents about wireless protocols without having to know key phrases such as wireless application protocol (WAP).
Clustering is another text mining technique with applications in business intelligence. Clustering groups similar documents according to dominant features. In text mining and information retrieval, a weighted feature vector is frequently used to describe a document. These feature vectors contain a list of the main themes or keywords along with a numeric weight indicating the relative importance of the theme or term to the document as a whole. Unlike data mining applications which use a fixed set of features for all analyzed items (e.g. age, income, gender, etc.), documents are described with a small number of terms or themes chosen from potentially thousands of possible dimensions. For example, a news story about Malaysia trade policies might a feature vector as illustrated in Figure 1. Figure 2 provides an example of a feature vector for an article about the Euro.
Figure 1: Feature Vector for Story on Malaysian Trade Policies
Although the two vectors share a dimension in common, most are different. The result is that unlike the relatively dense dimensional models in OLAP applications, dimensional models for documents are extremely sparse.
Figure 2: Feature Vector for Article on Euro
There is no single, best way to deal with document clustering; but three approaches are often used: hierarchical clusters, binary clusters and self-organizing maps. Hierarchical clusters use a set-based approach. The root of the hierarchy is the set of all documents in a collection, and the leaf nodes are sets with individual documents. Intervening layers in the leaf nodes have progressively larger sets of documents, grouped by similarity. Binary clusters are similar to k-NN clusters in data mining. Each document is in one and only one cluster, and clusters are created to maximize the similarity measures between documents in a cluster and minimize the similarity measure between documents in different clusters. Self-organizing maps (SOMs) use neural networks to map documents from sparse high-dimensional spaces into two-dimensional maps. Similar documents tend to the same position in the two dimensional grid.
The last text mining technique is summarization. The purpose of summarization is to describe the content of a document while reducing the amount of text a user must read. The main ideas of most documents can be described with as little as 20 percent of the original text. Little is lost by summarizing. Like clustering, there is no single summarization algorithm. Most use morphological analysis of words to identify the most frequently used terms while eliminating words that carry little meaning, such as the articles the, an and a. Some algorithms weight terms used in opening or closing sentences more heavily than other terms, while some approaches look for key phrases that identify important sentences such as in conclusion and most importantly.
With these techniques in hand, it is time to turn to the issue of integrating text in the data warehouse.
Extending the Warehouse
Extending the data warehouse to support documents and text mining will require new data structures as well as new tools.
Accommodating text in the warehouse requires support for the text itself along with its meta data. Storing documents is not a problem for RDBMSs that support binary large objects. Some, such as Oracle8i, provide direct support for documents in the warehouse.
Documents are meta data- intensive objects. In general, the data warehouse should support meta data about document source, analysis and content. Source meta data describes where a document originated, when it was loaded along with quality and timeliness information. Analysis meta data drives the type of text mining performed on documents. For example, e-mails should not be summarized, but they are good candidates for clustering using self-organizing maps. Content meta data should include at least the attributes delineated in the Dublin Core, a meta data standard for Internet resources. The Dublin Core includes title, creator, subject, description, dates of publication, copyrights, format and relationships to other works. Content meta data will also include information mined during text analysis, such as features and business relationships mentioned in the text.
Working with text will require additional tools. Although some features are built into database systems, additional functionality is needed to take full advantage of text mining. IBM Intel-ligent Miner for Text (www.ibm.com) includes summarization, document clas-sification and clustering tools. Oracle Intermedia Text (www.oracle.com) provides thematic indexing and summarization right in the database. Specialty tools, such as Megaputer's Text Analyst (www.megaputer.com) provides text mining functionality through COM objects for custom-built applications. Semio's (www.semio.co) taxonomy-generation tool can be used to automate the creation of ontologies while Mohomine's (www.mohomine.com) tool suite includes Web crawlers and document classifiers. Of course, it is the end user's needs that will ultimately drive the set of tools required for a particular application.
Text: The Next Dimension
If business information were an iceberg, text would be bulk of the glacial object hidden below the surface and usually forgotten. Fortunately, things are changing. Commercial quality text mining tools are available, and database vendors are recognizing the need to manage text along with numeric data. The Internet provides a wealth of raw material to complement internal documents. Whether a user needs to understand why an anomalous pattern is showing up in the data warehouse, monitor market conditions or conduct competitive intelligence research, text is central to meeting those business intelligence needs. The time has come to accommodate documents within the workhorse of business intelligence the data warehouse.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access