Free-form text data has long been a bane of our existence in data warehousing and business intelligence (BI). If only those operational systems that provide data to the warehouse required the data to be codified at the point of entry, our lives receiving and further processing that data would be much easier. At the least, if pop-ups with suggested contextual data would intervene at the point of entry and promote quality, consistent data, we wouldn't get such a mixed quality bag of data for the warehouse.
There are several reasons why this utopia will not happen any time soon and why we are going to see an increasing amount of free-form text flowing through the environment, including:
- Operational system change-out is very infrequent.
- Many operational systems are incapable of major changes such as those suggested here.
- Data validation at point of entry would slow the operational environment, and this would have a more serious impact on the business than even poor data quality.
- Forcing codification of data entry likewise could slow the operational environment.
- Data entry changes may require retraining of staff.
- Merrill Lynch and Gartner studies found that 85 to 90 percent of all corporate data is stored as text.
- The data size resulting from this high level of text is becoming less of a problem as processing capacity continues to double every 18 months.
- Database management systems are becoming increasingly able to deal with text as a data type.
- It makes sense for some text - such as e-mails, documents and call center logs -- to be free-form so as not to take away actual value from the data that would result from codification.
- There's value in the subtleties of communication.
Each BI program is going to need a strategy for text data mining that codifies the data in various ways for various uses, but maintains the complete text online and accessible for other uses. For example, having the free-form text of customer interactions allows trending over time in the areas of complaint (and praise), warranty claims and error tracking, all of which is clearly input to product development and service allocation. Likewise, market outreach programs and focus grouping are processes rich in generating data. By not restricting the feedback to a codified form, the subject can present, in her own words, what she thinks.
Text mining also allows for the sifting through of data in legal, healthcare and other industries traditionally rich in documents and contracts. The unstructured text collected in these applications must go through a multistep process to get at the underlying intended meaning in the data. This may involve spelling korrection (er..., correction), removing "noise" words that don't add value (such as "a" and "the"), fixing grammar and codifying what remains.
For example, consider this entry on a call center log: The quality of your servise at Acme, Inc. is very poor. I've called 10 times today and got no one. I was calling about how to sinkrohnize between different Acme installs. After effective data text mining, this could become something more actionable such as:
1. Service quality poor. Called 10 times.
2. Calling about topic: synchronization.
An automated multipass through the text was required to create the actionable information. Manual review tagging and on-the-fly codification is not required with data text mining. We can now increment the counts for 1) poor service complaints, and 2) calls about the synchronization feature.
We might allocate more resources to the synchronization feature to improve its usefulness and satisfaction. And, if we are characterizing and/or ranking customers, this could also factor into our processes.
You could be coding for data text mining today, but perhaps you don't call it that. As with many important aspects of BI, toolsets are available to be integrated into your environment. Either build or buy requires a rich and targeted meta data layer to provide the foundation to correct misspellings and grammar, remove the noise and otherwise refashion the free-form into something actionable.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access