In early 2007, FirstData scooped up Intelligent Results, an innovator in the use of text mining for predictive analytics. The subsequent acquisition by Business Objects of Inxight, described as a text mining and analysis company, occurred in May 2007 and was followed by the acquisition by Reuters of ClearForest, solidifying the trend. Meanwhile, Bill Inmon is contributing to catalyzing the trend by launching a start up with nine patent filings and an innovative approach to textual extract, transform, and load (ETL) that leverages the standard relational database and token parsing, avoiding the pitfalls and complexities of natural language processing while delivering results similar in scope and value. This does not make Business Objects a data aggregator or newswire service any more than it makes Reuters or FirstData a business intelligence (BI) company. However, it does show that the creation of information out of unstructured data represents the zone of proximal development in redefining the limits of what is possible using IT.
In both BI and content aggregation, it is what you don't know that can hurt you. The business cannot even express the terms of the relevant issue and remains inarticulate - until trouble strikes. Really big business problems have occurred when enterprises did not have the right answers because they were not even asking the right questions. It is this second order ignorance - I do not know what I do not know - that is most dramatically demonstrated in such front-page business meltdowns. Catastrophic failure in automotive tire tread separation, the backdating of options in the context of executive compensation, distress in the sub-prime mortgage lending market are all similar in that the answers as well as the questions were outside the focus of business awareness, analysis and problem solving. The resulting surprises have been both costly and painful.
The point is not to say that text mining is a silver bullet that will guard against any random business risk. Rather the point is that text mining is a powerful method of managing business risks as well as opening up new opportunities for profiting from discovery of the underlying mechanisms and causes that determine buying behavior, rule following and leading indicators of trends. Unstructured data is a vast realm where "I do not know what I do not know." Analysis of this data using text mining methods can provide early and leadings indications of trends, actionable predictions about customer behavior and confirmation for structured variables in the environment such as cost or product returns or complaints. (This article will not even touch on standards in the emerging market for text mining; and in the interest of completeness, Unstructured Information Management Architecture (UIMA) is one that has legs and deserves mention.1)
However, before you hand off this article to your executive administrator with the instructions "Get me one of those," it would make sense to take a closer look under the hood at the challenges, trade-offs and promises of text mining.
Generally, "text mining" works with a data store of written statements. These may be case notes from a call center. The text may be helpdesk problem ticket narratives or email correspondence with customers and clients, either external or internal. For example, Intelligent Results developed a solution for a collection agent writing up what happened when the collector called the person in arrears to invite them to pay something on the overdue account. You get abbreviations such as "HG" - hung up - or more verbose explanations such as "lost job due to ill health, but now back in the market - recommended payment plan A." If the debtor is in jail, then he is not a good candidate for a payment plan for obvious reasons - no prospects of income. Further calls will be a waste of time and effort, and the debt is a bad one to be written off. On the other hand, if the person is a college graduate, but just down on his luck due to loss of employment, illness or other life misfortune, then the prospects of collecting in the future are good. Action is required.
While it is an oversimplification, many text mining technologies go through the following series of steps in order to bring order and significance to what is otherwise a jumble of unstructured data. It remains true that if you can't structure it, you can't manage it. This process provides structure to the data, though not necessarily the kind of structure characteristic of the end result of a relational database. The first step is usually to determine the language of the text data. This makes a difference, for example, since in Spanish the adjectives sometimes follow the noun whereas in English they usually precede it. The fnext step consists in identifying and eliminating "noise words" such as "the" and "a" and a multiplicity of pronouns and adverbs as well as proper names and places that, while significant, do not contribute to the generation of meaning at the appropriate level for the problem. This elementary data scrubbing is followed by tokenization. This breaks up the text into identifiable entities and actions by means of automated stringing and unstringing based on common delimiters between words. At this point, the tokens may be further analyzed based on mapping to a dictionary of key terms relating to a particular problem domain ("semantic ontology") such as debt collection, complaint hot line in a given domain such as automotive or retail, intellectual property ("patent") descriptions, biochemical reactions or law enforcement issues. The resulting semistructured information is subjected to further analysis by means of statistical probability of occurrence of tokens, classification and clustering algorithms. These functions associate related terms based on frequency nearness of occurrence in the text. Tagging or indexing of the tokens or associated clusters is useful for further analysis, including search and discovery. Visualization of the resulting clusters, which often map to specific concepts such as customer, product, promotion - in retail or disease, symptom, treatment - in health care, is a common enhancement and usability differentiator.










Be the first to comment on this post using the section below.