Continue in 2 seconds

A Retrospective look at Data Warehousing

  • February 01 2001, 1:00am EST

In the beginning were applications, and applications ran transactions and were designed to solve specific business problems. The corporation thought it was building a solid foundation for its future information processing needs by building applications. However, history would show that those legacy applications were a tar pit that snared even the most powerful dinosaur that dared to cross it. Despite the problems with legacy applications, the dinosaur did not die easily.

The most serious problems with the legacy environment included:

  • The legacy applications solved yesterday's business problems and could not be changed or replaced.
  • The legacy applications looked at current data, not historical data; and current data represents only one aspect of information.
  • Collectively, the legacy applications were terribly unintegrated. Unfortun-ately, modern corporations needed integrated information in order to look at the entire corporation.
  • The legacy applications were built on older technology, and moving away from that technology was difficult.

The legacy applications were designed to collect and store information efficiently. They were not designed to allow data to be easily and elegantly accessed.
In the late 1980s and early 1990s, something had to be done to address the growing level of discomfort with legacy applications. Thus, the concept of the data warehouse was born. The data warehouse was a database that stored detailed, integrated, current and historical data.

The data warehouse was built from the detailed transaction data that was generated by and aggregated in the legacy applications. One of the major tasks of the data warehouse developer was going back into the legacy systems environment to find, convert and reformat legacy data. The task of integrating legacy data that was never designed for integration was daunting and dirty. Yet, this task was one of the cornerstones of the success of data warehousing.

In the early days of data warehousing, there were no tools for integrating and converting data. The work had to be done by hand; and it was thankless, toilsome work. Soon, a subindustry evolved called the integration/transformation (i/t) or the extract, transform and load (ETL) industry. Software products were created that allowed the legacy environments to be integrated in an automated manner. There were two types of ETL tools ­ code generators that could handle any conversion scenario and run-time generators that were easy to use but allowed for only limited complexity in integration.

The task of integration was so complex and intimidating that almost immediately there arose the notion that it was not necessary. The conventional wisdom was that if you could build a data mart from the legacy environment, you could bypass the nasty work of integration. Soon, multidimensional OLAP vendors proclaimed that with data marts you did not need a data warehouse and you did not need to do integration and transformation. Data mart vendors sold the story that you only needed to throw data directly from the legacy environment into their tool, and you would have an instant "mini" data warehouse.

The data mart approach was quite attractive until the fourth or fifth data mart. Upon building the fourth or fifth data mart, the architects discovered that the same detailed data was being used for every data mart, there was no reconciliation of mixed values from one data mart to the next, each new data mart had to be built from scratch and the interface from the legacy application environment to the data mart environment was very complex. The amount of resources required to pull legacy data from the source databases was a significant drain with the new data mart that was added.

In short, after the world had tried the data mart approach, it became obvious that the foundation of a data warehouse was the proper structure to build. The data mart would eventually have to be rebuilt properly with the foundation of a data warehouse.

Data marts, however, did find wide usage as departmental processing environments where the data from the data warehouse was fed directly into the data mart. The departments that enjoyed the multidimensional OLAP technology were marketing, sales, finance, accounting and engineering. These "dependent" data marts fit comfortably with the data warehouse. It was only where "independent" data marts were built (the data marts built directly from the legacy applications) that there was a problem.

Another related structure, which is sometimes classified as a type of a data warehouse on its own, is the operational data store (ODS). The ODS allowed the users to have direct online access to data while still having integrated data. The ODS was fed from the i/t layer of processing after the detailed data had been integrated. The ODS served many clerical, operational needs of the corporation. The ODS occasionally fed data to the data warehouse.

Data warehouses soon grew to be very large. Multiterabyte databases became commonplace. The increase in data was caused by the explosive growth of detailed data coupled with historical data. As warehouses passed the terabyte range, developers explored alternative design approaches. One of the design approaches they developed was near-line storage. Near-line storage is an alternative form of storage where data is stored in siloed tape units. With near-line storage, the cost of data warehousing drops dramatically and the volume of data that can be stored rises exponentially. Without an alternative storage media, the future of data warehousing was limited.

Another form of data warehousing appeared called the exploration warehouse. The exploration warehouse was a place where the heavy statistical analysis of data occurred. The two types of statistical processing were exploration processing where hypotheses and assumptions were formed based on observations of data, and data mining where those assumptions and hypotheses were tested. The exploration warehouse grew when it became apparent that only a limited amount of statistical analysis could be done on the data warehouse itself.

Coinciding with the building of the data warehouse was the advent of the Internet and Web sites. At first, Web sites were built independent of the data warehouse. However, several problems with Web site processing arose including the enormous volume of data that was generated, the need to integrate Web site data with other corporate data and the need to be responsive to changing market conditions.

The companies that wished to succeed with e- business quickly discovered that they had to have a data warehouse. The Web site passed clickstream data to the data warehouse through a granularity manager and received data warehouse data from an ODS.

The different forms of processing that surrounded the data warehouse came to be known as the corporate information factory (CIF).

An essential component of the CIF is meta data. Meta data is needed in order for each of the components to communicate and coordinate with each other. The meta data model that is needed for the CIF is called the distributed meta data model. In the distributed meta data model, different units of meta data are established, stored and managed in the various components of the CIF. There is meta data that belongs to the data warehouse, data marts, near-line storage and the ODS. Meta data is stored and owned locally but can be distributed across the enterprise without losing the integrity of ownership.

The CIF grew in different stages in different organizations; however, the general order of the stages was:

  1. The i/t or ETL layer was built.
  2. The data warehouse was designed and populated.
  3. The ODS was built.
  4. The data marts were built.
  5. Overflow data was placed in the near-line component.
  6. The exploration warehouse was built.
  7. The Web site was connected to the data warehouse and the ODS.

Along the way, the distributed meta data was built in each of the components of the CIF.
The CIF is able to accommodate many different functions and requirements. When built properly the CIF is able to:

  • Handle an unlimited amount of data.
  • Provide a single point of integration.
  • Provide a foundation of detailed data that can be reshaped in many ways.
  • Provide a rich historical foundation of information.
  • Allow the power of multidimensional processing to be unleashed.
  • Allow data to be explored at the lowest level of detail.
  • Allow key performance indicators to be tracked.
  • Allow different parts of the CIF to be used in very different ways.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access