Managing the Unmanageable: Text Analytics for Unstructured Information Management
Organizations differ widely in their management of unstructured data. In some cases, they have spent an exorbitant amount of money to leverage some of their unstructured text data – by investing and then reinvesting in search engine after search engine – hoping to improve the relevance of retrieved information.
At the other end of the spectrum, many organizations are ignoring their unstructured data altogether, continuing to use only about 20 percent of their total data, that which is structured only, to drive their business. Others have not begun to analyze their unstructured data because they believe that data must be perfect before applying analysis to it. They are waiting to derive new insights while their competitors leap forward.
In an Information Management piece about the Gartner report “The 10 Myths and Realities of Master Data Management,” Gartner analyst Andrew White notes another trend, a fatigued acceptance of poorly managed data overall, because IT has simply gotten “pretty good” at dealing with it. But it is the organizations that leverage both their structured and unstructured data together in a real-time, managed environment that gain the ultimate competitive advantage: finding opportunities to optimize business processes and strategic decision-making.
Coupling the rise of social networking with the amount of data and processes that are managed outside of the structured IT purview, it’s no wonder that the estimated 80 percent of structured and semistructured data has been difficult for IT to prioritize with other ongoing and equally pressing issues.
Some organizations have become fairly adept at dealing with semistructured data. For example, in the financial and health care industries, informational advancements have been applied in dealing with data exchange standards by converting the XML or legacy formats into relational formats. Other organizations are starting to address unstructured data by using reference-style architectures to incorporate the unstructured data into structured processes. These efforts complement user interface efforts that provide a centralized portal to federated content, whether that be structured or unstructured information. And although these efforts are commendable, organizations that take a more holistic approach to data management and enterprise analytics by using the data to help prioritize activities within a cohesive data management strategy are the ones that achieve the results that fundamentally improve the bottom line.
Closely related to the management of unstructured data is the management of unstructured application programming interfaces or Web services. And even though cloud-based sources and social networking applications have APIs that are often based on Web services, each interface is different. As a result, the ability for an organization to interact with these differing interfaces is typically highly problematic.
So what has actually changed in the last year or so? It’s interesting to do a Google search on a combination of terms like “unstructured data,” “integration” and “information management” because you’ll find articles that are five to 10 years old that could have been written today. The point is that although there is greater awareness of the issue and potential opportunity associated with unstructured data, real progress has been somewhat limited. It is not so much that the problems are not understood; it is a combination of organizations still struggling with managing structured data while others are having trouble prioritizing issues.
The fact that unstructured data is not often proactively managed by IT in the same way a structured, relational database or enterprise application is managed is often because unstructured data has organically been retained by specialized business areas, creating shadow IT organizations controlled directly by the business user communities. This makes standardized data management problematic. It also illustrates the need for data management initiatives to be driven by a combination of business and IT employees.
Although it’s difficult to generalize, because there is a wide discrepancy between technology leaders and followers it is typical to see one or more of the following initiatives as early forays into the realm of managing unstructured data:
- Integration of semistructured data, such as XML formats, industry formats, etc. The basic approach is to transform these formats into a common (structured) representation for processing and storage in a relational database system.
- Data quality efforts that leverage automated approaches for address and name standardization.
- Standardizing metadata so that it can be used to define relationships and associations that integrate unstructured data with structured data via a reference-style architecture.
- Basic search capabilities that provide the ability to find content in an unstructured environment.
- Content management systems used to manage document collections, largely composed of unstructured data assets.
Although these are in some cases good first steps and address some of the symptoms of dealing with stores of unstructured content, they fail to recognize the underlying information management requirements necessary to fully put this data to work for the organization.
So how can we approach this in a more effective way? What can be done to harness unstructured data in a managed process that retains the inherent insight provided by commentaries buried in document archives and spread across note collections and Web channels, which are simply too vast and disparate to manually read – even if you assume that human interpretation will be consistent across all materials and people? How do you help ensure business decisions are based on more than 20 percent of the available information? And how can we use this unstructured data to help prioritize enterprise data management needs? That is where text analytics come in.
The term “text analytics” addresses the analysis and use of unstructured (and the unstructured components of semistructured) textual data. Text analytics are methods used to decipher, model and structure the information contained in electronic textual data. Developed from a variety of disciplines, text analytics are used, among other things, to automate the reference and assessment of unstructured data in a data architecture context.
For data management activities this means that text analytics provide some of the core processing capabilities that are needed to evaluate and govern unstructured data. Natural language processing , linguistic rules and statistical models can be used to decipher the meaning of words and phrases contained in electronic text – the subjects of the text, the topical areas covered in the materials, the concepts, entities and their relationships. Once defined, the resultant models can then be applied to content that has not yet been examined,thus automatically generating the metadata from the content itself.
In this approach, semantic information is managed and a team that has the expertise and authorization to refine the models over time is centralized. Models are developed, deployed and monitored over time – just as they are in structured business analytics. Content tags and indexes can be defined by the models, and can be used to create a semantic-based layer on top of existing technologies.
Ontologies can be used to reference these systems to one another by defining the conditions under which different metadata relationships are valid. One of the most powerful aspects for an organization in using ontologies across information silos is to document the subject matter expertise – what only years of organizational experience can teach – formalizing what content is relevant, when it is relevant, and to what else it refers. End consumers of information are then assured that they have comprehensive representation of all relevant information, and are not blindsided by missing key topics, dependent activities or influential events.
Also key to successful information management strategies are policy rules and governance activities, which can be aided by analyzing the contents. Finding confidential information or copyright notations in materials will influence access privileges and can affect where such documents are stored. Information management strategies that focus on the purpose of the data – which until you know what it is, may not be obvious (nor already defined in metadata) – are influenced by the information itself. The unstructured data – the documents themselves – can prioritize the associated governance activity.
The same is true for information delivery. Entity and fact extraction is a common business application of text analytics.
For the business of IT, text mining and sentiment analysis are useful capabilities to analyze trouble tickets to aid in the identification of emerging issues and root cause analysis. Moreover, the free-format (unstructured) fields in transactional systems are often more valuable to operations than simple notifications based on thresholds. Analysis across the entire collection can result in undiscovered insights that were impossible to obtain from manual reviews, simply due to the volume and human inconsistency in interpretation.
What are the considerations before you begin? First and foremost, recognize that you can begin applying text analytics anywhere you have unstructured contentA comprehensive, multistage information management plan that includes unstructured data will benefit the organization, but doesn’t provide distinctive advantage. How information is used, what analysis is applied – and in this case, the integration of text analytics into information management processes – are the factors that lead to competency that cannot be replicated by competitors.