For the last many years there have been rumblings, models, and potential use cases to cross the tabular data world with that 80 percent of data everyone tells us is unstructured in enterprises. (That number is really not so shocking, given the verbosity and multiplicity of uses for unstructured content.)
But meanwhile, the data folks and the document/record/digital asset folks have pretty much gone about their separate business without a lot of yelling for a merger. Most of the text mining I have come across comes from scientific and medical research, where large but controlled sets of data were scanned for keywords and related terminology to describe trends and outcomes.
Back in April, analyst Ted Friedman told me none of Gartners customers were asking about merging structured and unstructured at his companys last BI Summit. Maybe they were too busy taking care of old business, though Ted pointed out that hed seen some interesting activity in law enforcement, and defense intelligence pulling odd objects off Web sites and news feeds into mashups and reports.
That doesnt surprise me either, but now I am hearing that text analytics can be sufficiently automated, driven by algorithms and parsed into fields to provide a regular structured feed back to our structured repositories, marts and warehouses. Advocates say its tricky but, yes, its doable today.
At one level this sounds simple enough, since our brains are pretty good at comparing the macro and the micro worlds. Were all used to looking at reports and writing outside factors into our thinking. We factor attitudes and feedback into our reasoning. But the very idea of automating a merged process that is repeatable, extensible and somewhat reliable across such inputs struck me as profound, a watershed concept that would take text mining mainstream. You expect to see this sort of thing in a high tech lab, but could our long discussed goal really just be a matter of waiting for the technology and methodology to get good enough for daily usability?
I think the current answer depends on what questions you are asking and how sure you want to be about your results, but the topic came up in our last two DM Radio shows, one on innovation and another on text mining itself. Eric Martin of SPSS (and now IBM) said that text mining technology is now mature enough to support this kind of activity, and dropped the names of customers including a Swiss telecom that was parsing and merging unstructured data to significantly reduce customer churn and even predict behavior based on a mix of captive data and outside content.
The even bigger leap Martin claimed was that text mining can constructively capture customer sentiment in a qualitative way that can then be quantified, something analysts have claimed as a shortcoming of forms, and even worse, clickstream data that detractors believe says more about behavior than attitude. Im not here to dispute either point of view, but I would like to talk to Martins customers.
Usama Fayyad, who once headed Microsofts data mining work and later served as chief data officer for Yahoo, also appeared on our last show and was similarly upbeat on marrying structured and structured data. Two of the biggest drivers for this, he says, are contextual advertising and the wealth of information to be mined from social networking sites. As I suspected, marketing is behind the push for this kind of technology, which means it would like to tackle the mother lode of content, the Web itself, in some or all of its semantic abstractions.
You need to extract the entities: who, what where, how and why, Fayyad says, and also listen for the tone of sentiment being expressed. You turn that into variables you look at in aggregate with other variables you have and make a much more informed decision. The rub, he says, is that analysts need to be fairly confident about what the algorithms are extracting from all this data. These algorithms can do a very good job and they can also fall apart.
A central distinction to me is the one between simple content extraction and the whiz-bang models that grow increasingly suspect as algorithms reach for more terms and semantic concepts. A desktop card reader can do a pretty good job of parsing a name, company and phone number. A customer survey can tabulate answers to structured questions and print out a semi-structured addendum from the comment and other boxes.
The morass that is social Web, on the other hand, can build confidence or be a confidence game. Marketers and other advocates of text mining are looking at the Internet and social Web sites as sources of buzz, leading indicators of sales based on the research customers do ahead of buying a product. Fayyad says this is a way to quickly determine if a product is fading or taking off.
Marrying all this back to one-to-one customer marketing is a task of another magnitude, but one advocates say is making its way back to the relational and tabular world of data. Fayyad is ready with optimism and caveats as he expressed in an interesting interview with another journalist. The algorithm, he said, knows nothing about the data. The curse of dimensionality is never to be underestimated. Segmentation is a prerequisite to any analysis.
But in Fayyads opinion, it is better to do something than to do nothing, based on confidence in what we think we can put a handle on. I often say an ounce of knowledge is worth a ton of data. If you have a way of embedding that knowledge in your approach you can save a ton of work down the road.
Fair enough, but Im looking for more validation. What are your thoughts on prospects for parsing and automation of feeds of unstructured data into a tabular repository or warehouse setting? How much do you think we will come to infer and predict from at-large Web behavior? Press the comment button below to share your story or opinion.