Data warehousing has come a long way, baby. Not so long ago, database theoreticians derided the data warehouse as setting the industry back 25 years. Today, data warehousing is conventional wisdom and a standard part of the corporate information infrastructure.
The past is only a prelude to the future. Looking into the crystal ball, one sees many things - very, very large data warehouses, exploration processing, enterprise resource planning (ERP) vendor support and analytical applications. Perhaps the most intriguing and most promising advances in data warehousing are the possibilities of bridging unstructured data with structured data.
There are two basic forms of unstructured systems - external and internal unstructured systems. External unstructured systems are those that embrace the data found outside the corporation. The Internet is easily the most vibrant example of external unstructured systems. There is an internal world of unstructured data existing within the organizations walls, which holds a world of informational wealth.
For years, corporations have had two types of systems - formal systems and informal systems. The formal systems have been dominated by databases and transaction processors. Indeed, the worlds of banking, finance and manufacturing make their day-to-day decisions based on transactional processing systems. There is another very important part of the information infrastructure that is not formal - the unstructured informal systems of the corporation. When people think of the unstructured informal systems, their first thought is usually of e-mail. Indeed, e-mail makes up a tremendous part of the informal systems environment, but there is much more to the informal decision-making environment. There are many, many different kinds of unstructured information including spreadsheets, reports and documents.
Internal unstructured data comes in two basic flavors - documents and records. Unstructured documents hold voluminous amounts of text and are notorious for having no form. One unstructured document can differ greatly from another unstructured document. There is no uniformity whatsoever to the unstructured documents.
Unstructured records are a different story. Even though there is no rigid format among unstructured records, there is a marked similarity between the records. Typical unstructured records are contracts, insurance policies, warranties, medical records, financial records and so forth. In addition, e-mails can be considered a form of unstructured records. With unstructured records, there is no fixed or even well-defined format.
Another major difference between unstructured records and unstructured documents is that unstructured documents do not normally have what can be called a "key" or "primary identifier" value. Trying to match the content of unstructured documents to similar or related data in the structured environment is strictly a hit-and-miss affair. However, trying to match content between the unstructured record environment and the structured environment is a fairly straightforward process, given the repeating nature of data found in the unstructured record environment.
Trying to bridge the gap between the unstructured environment and the structured data warehouse environment is reminiscent of the early days of extract, transform and load (ETL), when people were not sure they needed a data warehouse and were even less sure that they needed an ETL tool. What exists to bridge the gap between unstructured data and structured data is crude and unfocused. The best that can be said is that some products have some capabilities, and those capabilities appear to be an afterthought. The focus on the vendor-based products for the unstructured environment has been toward external unstructured data, not internal unstructured data.
There are some basic problems facing the organization that wishes to create a bridge between the two worlds:
- Access of data. The technology used to support and manipulate unstructured data is quite different from the technology used to support and manipulate the structured world. For the most part, the unstructured vendors have been content to remain in the unstructured world and the structured vendors in the structured world.
- Cross-pollination of environment content. Unstructured data simply does not have the discipline and integrity surrounding it that structured data has. When a value is found in the unstructured world, it is questionable whether the same value found in the structured world is actually the same. When "bill inmon" is found in the structured world, is it the same as "bill inmon" in the unstructured world? Consider if an e-mail said, "It is high time that we bill inmon floral services."
- Synchronization. How do you keep track of changes in one environment and keep them synchronized with changes in the other environment?
What are the implications of opening the world of data warehousing to unstructured data? Quite frankly, a whole new world opens up. The world of data warehouse and data marts has been almost exclusively a world of numbers - roll ups, summaries and drill downs. From an analytical standpoint, 99 percent of the analysis is numerically based. The advent of unstructured data into the world of the data warehouse means that there are entirely new and unexplored possibilities.
In today's world, there is much talk about the 360-degree view of the customer. The 360-degree view is a wonderful concept, except where are the communications that have transpired between the customer and the corporation? How good is it for the corporation to know wonderful demographics about a customer when the customer has written an acerbic e-mail the previous week?
The truth is, there are an almost limitless number of ways that unstructured data enhances a data warehouse. It provides a dimension that is not possible through the standard quantitative analytical tools that are available today.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access