Continue in 2 seconds

Looking Ahead: Unstructured Data

  • August 01 2004, 1:00am EDT

Data warehouses have come a long way. Not so long ago, database theoreticians derided data warehousing as setting the industry back 25 years. Today, data warehousing is a standard part of the corporate information infrastructure.

However, the past is only a prelude to the future. Looking ahead, one sees very large data warehouses, exploration processing, enterprise resource planning (ERP) vendor support, analytic applications and the like. Perhaps the most intriguing and most promising advance in data warehousing will be bridging unstructured data with structured data.

There are two basic forms of unstructured systems - external and internal. External unstructured systems are those that embrace the data found outside the corporation. The Internet is easily the most vibrant example of external unstructured systems. However, within organizations' walls, there is an internal world of unstructured data that holds a wealth of information.

For years, corporations have had formal and informal systems. The formal systems have been dominated by databases and transaction processors. Indeed, many in banking, finance and manufacturing make their day-to-day decisions based on transactional processing systems. Yet, there is another very important part of the information infrastructure that is not formal - the unstructured informal systems of the corporation. When people think of the unstructured informal systems, the first thought is usually of e-mail. Indeed, e-mail constitutes a tremendous part of the informal systems environment. However, the informal decision-making environment also includes spreadsheets, reports, documents and more.

Internal unstructured data comes in two basic flavors - documents and records. Unstructured documents hold voluminous amounts of text. They are notorious for having no form. One unstructured document is as different from another unstructured document as is possible. Unstructured records are a different story. Even though there is no rigid format among unstructured records, there is a certain similarity between the records. Typical unstructured records are contracts, insurance policies, warranties, medical records and financial records. In addition, e-mails can be considered a form of unstructured records.

Unstructured documents do not normally have what can be called a "key" or "primary identifier" value. Trying to match the content of unstructured documents against similar or related data in the structured environment is strictly a hit-or-miss affair. However, trying to match content between the unstructured record environment and the structured environment is a fairly straightforward process, given the repeating nature of data found in the unstructured record environment.

Trying to bridge the gap between the unstructured environment and the structured data warehouse environment is reminiscent of the early days of extract, transform and load (ETL) when people were not sure they needed a data warehouse and were even less sure that they needed an ETL tool. What exists to bridge the gap between unstructured data and structured data is crude and unfocused. The best that can be said is that some vendors have some capabilities, and those capabilities appear to be an afterthought. The focus on the vendor-based products for the unstructured environment has mainly been toward external unstructured data.

There are some problems facing organizations that wish to create a bridge between the two worlds:

Access of data. The technology used to support and manipulate unstructured data is quite different from the technology used to support and manipulate the structured world.

Cross-environment content pollination. Unstructured data simply does not have the discipline and integrity of structured data. There is some question as to whether a value found in the unstructured world is the same as a value found in the structured world.

Synchronization. How do you track changes in one environment and keep them synchronized with changes in the other environment?

The world of data warehousing and data marts has been almost exclusively a world of numbers - rollups, summaries and drill downs. From an analytical standpoint, 99 percent of the analysis is number based. The introduction of unstructured data into data warehousing brings entirely new and unexplored possibilities.

Consider CRM, for example. In today's world, there is much talk about the 360-degree view of the customer. This is a wonderful concept, but where are the communications that have transpired between the customer and the corporation? Is there any value in knowing wonderful demographics about a customer when the customer has written an acerbic e-mail the previous week?

Unstructured data enhances a data warehouse in a number of ways. It provides a dimension that is not possible through the standard quantitative analytic tools available today.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access