Taming the World of Unstructured Data

  • September 01 2007, 1:00am EDT

Today's information comes in many types, shapes and sizes. It can be created, stored, shared, consumed and destroyed in myriad ways. Arguably the greatest benefit of the Internet revolution has been its ability to support the near-instantaneous dissemination of information, which often results in the creation of even more information. A recent example can be found in the phenomenon that is YouTube. By allowing users to quickly and easily create content (in YouTube's case, video) incorporating limitless contributions from others, the amount of sheer content being produced is exploding. Just how rapidly are these changes occurring? The most popular form of communication for many of today's workers - email - simply did not exist in a commercial sense 20 years ago.

Video and email have something else in common beside the rate at which they are being created: both are forms of unstructured or semistructured data. As opposed to structured data, which typically resides in a tightly controlled application, unstructured data is masses of (usually) computerized information that either does not have a data structure or has a data structure that is not easily readable by a machine. This latter factor has traditionally made unstructured data highly challenging to deal with in large quantities. Without the ability to automate the indexing, storage and handling of unstructured data, there is simply no effective way for computers to keep track of what is in each piece of unstructured data. For the average consumer, YouTube user or even rank-and-file employee, this does not present much of a problem because they can simply view a video or read an email and know what it is about. But today's businesses are experiencing significant heartache - and expending a great deal of money - in order to address the problem.

At roughly the same time information volumes (especially those of unstructured information) were exploding, two seemingly unrelated trends also gained critical momentum: compliance and e-discovery. Stringent statutory oversight of business was the byproduct of the morally bereft excesses of the Internet boom; names like Enron, WorldCom, HealthSouth, Adelphia and Tyco led to Sarbanes-Oxley, numerous Securities & Exchange Commission rules and many other compliance requirements. Corporations large and small are now required to institute extensive controls to ensure that (among other things) the data within their networks is known, tracked and accounted for at all times. Failure to implement and maintain such controls could lead to increased scrutiny, fines and, probably most harmful, bad publicity. Corporations now have to prove they are playing by the rules - but in order to do so, they need to be able to get a handle on the exploding volumes of data appearing on their networks every day.

Simultaneously, the nascent field of e-discovery is squeezing these same corporations from a different direction. Under the U.S. legal system, any party to litigation is obligated to share any relevant and nonprivileged information under its control with the other party or parties to litigation. When such information lived exclusively in paper format, the process was manageable; attorneys would receive and review all potentially relevant documents, decide what would be given to the other side, make hard copies of the pertinent documentation and proceed from there. This process has been referred to as "discovery" for eons. While the process itself did not change with the digitization of information, the scale of the review process grew exponentially: what was once 10,000 pages of documentation that would take a single review attorney perhaps a week to review became tens of millions of pages of documentation in myriad formats that would take a 20-person team a month to review. To make matters worse, the dollars associated with litigation eclipsed even those of compliance. Litigation alone could cost tens of millions of dollars, and an adverse judgment could exceed this number many times over.

Corporations were thus presented with a dubious choice, one that really wasn't a choice at all: attempt to get the unstructured data genie back in the bottle in favor of the old paper-based world or lean heavily on technological tools to implement an infrastructure better equipped to handle both structured and unstructured data.

Because it lacked a structure of its own and was traditionally not easily readable by a machine, unstructured data presented a particularly tough challenge for corporations trying to get their data under control for compliance and e-discovery purposes. A significant portion of the problem lay with the fact that the tools corporations tried to use to deal with this data were not particularly advanced or helpful. Categorization tools - which would extract key terminology in a piece of information while discerning its overall meaning - were rudimentary at best. And search tools - which would allow users to find specific pieces of information based on the information's content - were not easy to use and, as opposed to their cousins in the Web search world, ill-suited for meeting the scalability, security and relevance needs of enterprises.

Then something interesting happened: the search and categorization industry grew up. After a few false starts and some premature hype, search and categorization tools became easier to use and, more importantly, started delivering better results. Search and categorization tools eventually became the unifying force of information management within many enterprises and professional service firms as they could make sense of huge volumes of data in a relatively effective fashion. Furthermore, search and categorization technology began solving particularly thorny issues such as records management, compliance and e-discovery, which went a long way toward cementing the critical role that search is playing in today's enterprises. The following three brief case studies highlight the increasingly effective roles being played by search and categorization to resolve specific business issues.


Patents are the lifeblood of the pharmaceutical industry. Every additional month of patent protection - by either filing a patent sooner than a competitor or having patent protection for an additional month - can translate into literally millions of dollars in lost (or additional) revenue. Alternatively, reducing a competitor's patent protection can similarly lower their revenue by a material amount. In an effort to stay on top of global patent activity, one of the largest pharmaceutical companies in the world had a particularly challenging issue: how to keep its researchers abreast of relevant patent applications and issuances in myriad jurisdictions in near real time. Understanding a patent application and its relevance requires highly specialized training, which makes staffing a team of people to manually go through every patent filing prohibitive.

The solution: deploy a sophisticated conceptual search and automatic categorization solution indexing federated content. This pharmaceutical company's technology of choice takes federated content from outside the enterprise (in this case, from patent-filing databases from jurisdictions all over the world) and classifies it according to the words and concepts contained in the filings themselves. Not only are researchers able to search this rapidly growing patent database for filings related to their particular field, but the system automatically categorizes every patent into the appropriate bucket based on what the patent is about. Thus, the researchers are able to stay abreast of every patent in their field in near real time without having to slog through the content themselves. The solution's conceptual search functionality - which relates concepts to many different permutations of terminology that could be used to describe it - is particularly useful here, as it means researchers are not required to guess the right search terms in their queries.


For the legal industry, time is money - literally. With associates' billing rates exceeding $250/hour and partners' upward of $500/hour, efficiency is critical. The challenge for law firms is that their incredibly valuable intellectual property (their work product and expertise) resides in multiple, separate repositories and applications, making information accessibility extremely difficult and time-consuming. Worse, particularly for large diversified firms bidding on new business, lawyers don't know the full breadth of expertise living within the firm and will either spend a significant amount of time figuring this out or will simply avoid bringing in new clients for fear that the firm won't be able to meet their extensive needs.

The solution: a search application that unifies access to all data within the firm in a single, easy-to-use interface, thereby giving access to all of the work product and expertise within that firm. This solution not only pulls information from the usual sources (file servers, databases and intranets) but incorporates highly sensitive sources (e.g., from time/billing systems and personnel records) and even external information feeds. And in order to meet the firm's stringent ethical and conflict of interest-avoidance requirements, the system applies multiple levels of security to both the users of the system and the content residing in it. Thus, the legal industry has increasingly turned to this "Google for law firms" solution to make its practice far more efficient, thereby allowing them to raise their rates while actually improving their cost-effectiveness for clients.


The Internet revolution has had a dramatic impact on many industries, and with the possible exception of the music industry, the publishing industry may have experienced the most upheaval. Content became free, and newspapers started losing a huge chunk of their classifieds revenue to Internet upstarts. While the Internet newbies knew how to get content to the masses nearly instantaneously, they were not themselves the creators of content the masses wanted to see - publishers still owned this important piece of the pie. But publishers needed as frictionless of a distribution process as possible, and with the rapid growth of information, this was becoming increasingly difficult and expensive to handle manually. European media giant Bertelsmann is a perfect example. The company was being bombarded with tens of thousands of news stories every day and had no efficient method to get each story to the right editorial desk.

The solution: completely automate the routing of these news stories via a sophisticated categorization application. The application discerns the story's meaning using true conceptual technologies that understand that "Java" can refer to coffee, islands in the Pacific or a computing language from Sun and routes the story to the appropriate editorial desk. What was once a manual process is now almost completely automated, thereby ensuring that the company can keep up with the explosive growth of information in the years to come.

The digital revolution has been a tremendous boon for consumers and businesses alike, but what has excited and energized individuals has also become a sizable and costly headache for many enterprises that often struggle to meet their compliance, records management and e-discovery needs in the face of this tidal wave of information. Many forward-looking companies have deployed sophisticated search and categorization technology with great effect. Using such cutting-edge tools to automate the organization and retrieval of information, these companies have a built-in competitive advantage that should help them for years to come.

