In my previous column (see the February 2004 issue of DM Review), we looked at the high-level aspects of a text mining project. This month, we'll turn our attention to assessing text mining tools available for the job.
The growing need for text analysis is leading to a proliferation in tools. Choosing among them does not need to be as difficult as it might initially look. Some key considerations to keep in mind in are: extract, transform and load (ETL) capabilities; language requirements; preprocessing; interactive tools; and scalability.
Look for ETL capabilities in the tool. If you are dealing with descriptive notes in a database, you'll want a tool that accesses them directly. You can extract free-form text from the database, analyze it and merge the results with its associated structured data, but this becomes more of a challenge as the size of the database increases. If your text is in documents, look for file conversion modules that can extract text from your source formats (e.g., .doc, .PDF, .ps). Most file format conversion programs will not extract text embedded in graphics. This could be an issue for engineering and some scientific documents.
Many linguist tools are designed for the English language. If you are working with other languages, make sure the tool can at least process the character set and break the text into lexical tokens. Inxight's core linguistic processing system, LinguistX, may provide the broadest language support, as it covers dozens of languages. Inxight's feature extraction tool, ThingFinder, supports English, French, German, Spanish, Arabic, Farsi and simplified Chinese.
Some form of preprocessing is required with most text. When analyzing database notes and comments, look for a tool that supports synonym lists. Analysis will be more effective if you can map terms such as "mgr," "mang" and "manager" to a single word. Most linguistic-oriented tools also support stemming, which identifies the root of a word. Noun-phrase extraction can also improve the quality of analysis by allowing you to focus on concepts rather than simple terms.
Interactive tools simplify the job of developing categories, synonym sets and stop word lists. Statistically oriented programs should also include visualization tools for examining term co-occurrences, such as dendrograms or tree graphs, for showing term co-occurrence or proximity within documents. Two- and three-dimensional maps are also useful for understanding how terms are related based on how frequently they occur together.
Scalability can quickly become a problem in text mining applications. Because individual words and phrases each constitute a feature, notes from a reasonable sized database or documents from a content management system can easily generate tens of thousands of unique features. Many of these are irrelevant for categorization and clustering. Feature reduction techniques can identify the most significant features and eliminate terms that are redundant (at least from a categorization perspective) and reduce the feature set to a more manageable size.
Text mining tools are available as standalone "best of breed" components or integrated into enterprise applications. The standalone text mining tool market includes vendors such as Inxight, ClearForest, Megaputer, Text Analysis, TEMIS and SimStat.
Statistical software vendors SAS and SPSS both offer integrated data mining and text mining tools. SAS uses Inxight's LinguistX and ThingFinder components. SPSS uses LexiQuest, a strong tool with a solid linguistic foundation. We can expect more business intelligence vendors to begin supporting unstructured data analysis in the future.
Choosing between standalone products and integrated packages is, of course, a matter of tradeoffs. The standalone tools may offer a greater deal of flexibility with full access to application programming interfaces (APIs) and flow of control of your text mining process. Some, such as Text Analysis International's VisualText, provide a comprehensive development environment for designing rule-based text processors. The additional cost for this benefit is the need to understand more of the details of text mining.
For those who really like to dig into the inner workings of text mining, open source tools such as the GATE (General Architecture for Text Engineering), the N-Grams Statistics Package and Perl modules, such as Lingua, provide a great deal of core functionality. You will still need to develop components to integrate with existing systems and may need to acquire some commercial tools, such as file format conversion programs. Verity KeyView and ReSoft International's KEYpak are two products that can fit the bill.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access