Editor’s Note: This article is the second in a three-part series focusing on the data warehouse evolution.
DSS Part 1 Enterprise Data Warehouse (1991)
In 1991, Bill Inmon published his first book on data warehousing.1 Rather than focus exclusively on why you need one or what it is supposed to do for you, Inmon provided the first widely available how-to guide on building a data warehouse. This book contains the most widely published definition of a data warehouse.
Inmon’s definition helped immensely by providing concrete guidelines. The core components of his definition are as relevant today as when they were published eight years ago. (Figure 1)
|Defining the data warehouse|
|A Data Warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions. |
Subject- orientation mandated a cross-functional slice of data drawn from multiple sources to support a diversity of needs. This was a radical departure from serving only the vertical application views of data (supply-side) or the overlapping departmental needs for data (demand side).
The integration goal was taken from the realm of enterprise rhetoric down to something attainable. Integration is not the act of wiring applications together. Nor is it simply commingling data from a variety of sources. Integration is the process of mapping dissimilar codes to a common base, developing consistent data element presentations and delivering this standardized data as broadly as possible.
Time variance is the most confusing Inmon concept but also a most pivotal one. At its essence, it calls for storage of multiple copies of the underlying detail in aggregations of differing periodicity and/or time frames. You might have detail for seven years along with weekly, monthly and quarterly aggregates of differing duration. The time variant strategy is essential, not only for performance but also for maintaining the consistency of reported summaries across departments and over time.2
Non-volatile design is essential. It is also the principle most often violated or poorly implemented. Non-volatility literally means that once a row is written, it is never modified. This is necessary to preserve incremental net change history. This, in turn, is required to represent data as of any point in time. When you update a data row, you destroy information. You can never recreate a fact or total that included the unmodified data. Maintaining “institutional memory” is one of the higher goals of data warehousing.
Inmon lays out several other principles that are not a component of his definition. Some of these principles were initially controversial but are commonly accepted now. Others are still in dispute or have fallen into disfavor. Modification of one principle is the basis for the next leap forward.
Enterprise Model: For Inmon, the foundation of his subject-oriented design is an enterprise data model. His roots are in data administration, and he has long favored a top-down approach. As the pace of change accelerated in this decade, information technology organizations become more tactical. The data warehouse became the last bastion for enterprise modeling advocates as companies switched from build to buy. Even here, top-down enterprise modeling has been supplanted by more focused subject area designs.
Enterprise Scope: The biggest casualty in this arena has been the goal of building an enterprise-spanning data warehouse. There are too many sources, too many uses, too much volatility and so little time. Moreover, it has proven impossible to get and hold the support of management long enough to build such a broad-based asset. At least it has with the methods used so far.
Decision Support: A basic assumption of Inmon is that a data warehouse is exclusively a store of data for decision support. In his definition, this precludes the use of a data warehouse for what he calls operational reporting. Early on, his writings were ambiguous regarding whether a data warehouse could be used to feed data back to the operational world or to downstream analytic systems.
This intentionally restricted DSS-only perspective initiated a debate that goes to the very heart of what data warehousing is all about. What defines operational reporting? Is it transaction detail? Near real-time visibility? A current-only view? Is it okay to use a data warehouse as a data delivery hub or does this conflict with the DSS mission? How do we balance the multiple purposes? More on this later.
Atomic Detail: The need to collect atomic detail to build a historic base of reusable data is a component of the early architectures and is supported by the Inmon approach. However, many early decision support advocates questioned the value of retaining the most granular detail. After all, they argued, management analysis is always down at an intermediate summary level. Database administrators and operations managers colluded with these misguided DSS folks, claiming performance impacts and capacity constraints.
Early champions stressed that data warehouses had to be built for future needs not just current requirements. For every request such as, “Show me product category totals by month and region,” there lingers, just over the horizon, the follow- on requirement to see products within category or stores within regions or a daily trend. It is far cheaper, even in the short run, to collect and keep the atomic detail than it is to write an expensive program to summarize the data only to have to go back to the source again.
This battle has been won. The multiterabyte databases of today are a result of these voracious needs being satisfied.
Snap Shot Capture: In support of his view of decision support, Inmon defined the need to capture a data “snapshot” representing a consistent state of data as a specific point in time. There is no arguing with the basic principle of consistency management. However, this concept has led us down several blind alleys and set us on a course we are only now about to reverse.
The most simplistic interpretation is that you copy the source (“snapshot”) and load it as is into a data warehouse table. During each cycle, you essentially empty the warehouse table and repeat the process. Inmon never supported this form of full refresh processing, but those who failed to understand the fundamental essence of nonvolatility and historical retention used his definition as justification.
The correct interpretation of snapshot is the capture of net incremental changes. This is now commonly understood. However, debate rages over how to detect and store these changes.
The snapshot concept led to the ETL (extraction/transformation/loading) approach to filling up a data warehouse that is still in vogue today. It is based on the assumption that you build a data warehouse as a DSS add-on independently of the source systems that already exist. The early architectural treatments assumed that the operational and analytic solutions would be co-designed. At a minimum, the source systems could be, and should be, modified to feed data optimally to the data warehouse. The ETL method has you look back at the source database periodically to select, by some filtering mechanism, only those instances added, changed or deleted. Early design theorists considered this time-consuming and error-prone task of finding what changed after the fact a distance third choice.
When data warehousing got started, many implementations acquired data only once a month. Most high impact data warehouses today have ramped up to daily acquisition for most data sources. This is likely to be the theoretic limit of ETL processing.
What we are now seeing is the return of co-engineered solutions. This will change data warehousing fundamentally. Real-time acquisition is coming and will expand our range of opportunities enormously.
Relational versus Star versus Cube: One of the most contentious debates has been around the ONE CORRECT data design philosophy. This debate has many facets but I believe it has the greatest effect on how you store incremental net change details. Inmon is the champion of third normal form storage of historical detail. He offered a start and end date scheme for recording changed data but several others methods have also been proposed. Ralph Kimball later offered dimensional modeling and his star schema structure. He alone defined five different methods for recording changing data.
In the end, Inmon’s definition of data warehousing was the spark that ignited widespread interest while carrying the seeds of controversy that would cause increasing fragmentation in approach.
- W.H. Inmon, Building the Data Warehouse, QED/Wiley, 1991
- The need for a time variant design is often misunderstood, overlooked or outright refuted by those who are most partial to Inmon’s approach. Inmon is sited as the champion for the normalized storage model as opposed to star schema or cube methods. Those who advocate exclusively normalized design are also those that reject the need for physical storage of multiple aggregate levels. They abhor redundancy and erroneously consider a time variant design to be a redundant design.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access