Continue in 2 seconds

Text Analysis Systems

Published
  • August 01 2003, 1:00am EDT

I had planned to stop writing about homeland surveillance (and the government, no doubt, has the e-mails to prove it); but then a newspaper article about Pentagon advisor Richard Perle's conflicts of interest casually mentioned that Perle is a director of Autonomy Corporation. Autonomy, a well-known provider of text search and retrieval software, seemed an oddly pacific interest for the famously bellicose Perle. However, a little research revealed that Autonomy has long received significant revenue from government security agencies. More exploration found that Autonomy competitors including Inxight, Stratify, Attensity and ClearForest also have large intelligence contracts. Interesting. Just how does this technology fit into a surveillance infrastructure?

Or, to put first things first, what technology do these companies offer? The most common answer would probably be search engines –­ software to locate information in word-processing files, Web pages and other unstructured text formats. (Structured formats refer to databases and files where each element has a specified location. XML documents, which tag elements within an unstructured format, are sometimes called semi-structured.)

Yet search is just one reason to access text. Users also want to extract information, generate lists of related documents, visualize results and associate content with individuals. Therefore, a better label for these systems might be text analysis, indicating that they extract meaning from text in the same way that data analysis extracts meaning from data. In concrete terms, the core function of these systems is to attach labels to text so that the labels can be searched, sorted, grouped and otherwise processed like structured data. The labels might describe the whole document (an article about surveillance systems) or extract specific information (Richard Perle is a director of Autonomy Corporation).

Two different techniques are commonly used to assign the labels. Statistical techniques analyze the frequencies and patterns of words in a document; basically, they develop statistical profiles of documents in different categories. New documents are then analyzed and assigned to the categories their profiles most resemble. Semantic techniques use dictionaries and syntax rules to identify key words and relationships. They also use these to assign documents to categories.

Most text analysis systems are based on one method or the other, although vendors increasingly apply elements from both. They also typically employ supplemental techniques such as key words, rules and weights based on how often a document is used. In general, statistical methods are less language-dependent and more able to recognize complex concepts, while semantic methods are better at identifying specific facts and relationships. Both types of systems are often trained with previously classified documents during implementation. The classification scheme itself, called a taxonomy, is also usually provided by the user, although most systems can automatically generate a rough taxonomy when necessary.

Once documents have been labeled, the results can be used to retrieve search results, list related documents, create document summaries, populate databases with extracted facts, find trends in document contents over time, profile user interests (based on what they read) and expertise (based on what they write), alert users to relevant new information, and create graphic displays of related items. More fundamentally, attaching consistent labels to documents from different sources enables users to integrate information that would otherwise be searched separately or not at all.

These capabilities have many commercial applications. Text analysis products power search engines, analyze customer comments, respond automatically to e-mails, build communities of users with shared interests, select the best offer for individual customers, gather prospect or competitor data from Web sites and generate personalized news reports. Intelligence agencies have similar requirements, and most of the text analysis systems used by these agencies apparently serve similar purposes – ­ most, but not all. Text analysis systems can also be used for direct surveillance ­– reading and classifying personal messages. Although such surveillance can also be conducted by human monitors, the software makes it possible on a much larger scale. In fact, one of Autonomy's publicized features is a capability to transcribe verbal communications and then analyze the resulting text. The company points to non-surveillance applications such as indexing television broadcasts and capturing multimedia presentations. However, the surveillance possibilities are self-evident.

Still, how useful could it be to listen in on millions of conversations? Presumably any terrorists bright enough to be dangerous would be bright enough to communicate in code. Additionally, if the software can only assign documents to categories it has been trained to recognize, how could it recognize conversations about something new?

Interestingly, those problems may not be insurmountable. A document's failure to match any existing category may itself be significant. For example, a string of gibberish is worth examining more closely to see if it represents a message in cipher. This contrasts with pattern recognition software, such as fraud detection systems, which can only look for patterns defined in advance. Text analysis systems can also identify new categories by examining documents that are currently unclassifiable. Therefore, if a group of suspects suddenly starts talking about hunting baboons, the very oddity of the phrase could set off alarms ­– although other types of intelligence would still be needed to find what the suspects really meant.

Of course, mass surveillance of this type would generate many false alarms, and serious conspirators could almost surely avoid detection. Surveillance limited to known suspects would be more effective, but this begs the question of how those suspects will be identified in the first place.

Text analysis could be extremely effective at spotting political or religious opinions, but the security value of doing this is questionable; real terrorists don't make speeches. Monitoring opinions also raises privacy and civil liberties issues, although these involve political rather than technical judgments.

In short, text analysis systems can clearly help surveillance organizations work more efficiently through better research, integration and collaboration. They may also have some value in performing automated surveillance, although this is probably less effective than claimed by their supporters or feared by their critics. While the potential for abuse is real, any transgressions are ultimately the responsibility of the people and agencies that use the systems, not the software itself.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access