True Mission
This week we continue our discussion of architectural overhaul. We will build on ideas introduced in previous sections. If you have not read them yet, I recommend that you do so before proceeding. (DM Direct beginning May 10, 2002.)
One ultimate goal is to move toward facilitating the flow of information. Before we can begin, we must first master the selection of optimal structures for our data warehouse architecture. The incorrect match between data structure and its intended role in our architecture can be a barrier, rather than an enabler, to the flow of information.
Confusion of Methods
The history of data warehousing is dominated by the war of competing design frameworks. I call them the Rationalists, the Stargazers and the Cubists. The Rationalists believe in the purity of the relational model as expressed by the rules of normalization. The Stargazers are religiously devoted to the star schema and the practice of dimensional modeling. The Cubists proclaim the supremacy of multidimensional database structures.
The Cubists have always been a niche faction in the data warehouse pantheon. They are actually the oldest cult since multidimensional databases (MDDB) pre- date the relational DBMS that supports the other two factions. The MDDB has a strong following in select functional areas, such as finance, but is not suitable for core data warehouse functionality.
The Cubists have no single spokesperson today. However, their passion was given credence when the original father of relational theory, E.F. Codd, defined the rules for online analytic processing (OLAP) in the mid-1990s. This manifesto supported the arrival of a new, more capable generation of multidimensional products, such as Essbase, and later Microsoft's Analysis Services for SQL Server.
The Rationalists were originally the dominant faction. Relational design unified what was previously a chaos of proprietary databases, each with a unique methodology. For most of the early DW architects this was not only all they knew, it was a proven approach for traditional application design. The pure Rationalist line is that the whole of the data warehouse must be implemented in at least third normal, non-redundant form.
Bill Inmon is often touted as the champion of the Rationalist faction. This is a fundamental misreading of his much more sophisticated framework. From the very beginning, he advocated layers of lightly and heavily summarized data sets in a multiplicity of time variant forms to serve as the access layer for a data warehouse. These are what we refer to as summary/aggregates. They feature various degrees of denormalization and definitely violate the pristine notions of non-redundancy. Summary/aggregates are a necessary component of a functional data warehouse.
The Stargazers are the newest and fastest growing faction. The concept of a star scheme is about as old as relational theory, and variants of this method were used in early mainframe-based information centers on tabular databases such as Ramis, Focus and Nomad. The Metaphor marketing analysis workstation, introduced in 1983, was based on a proprietary star schema database.
Ralph Kimball is responsible for introducing data warehouse practitioners to this design approach. It is the dominant model for data marts and a growing component of more cross-functional data warehouses. Dimensional modeling offers a more business-intuitive approach than entity- relationship techniques and result in a more access- optimized design. He and his advocates are strong opponents of purely normalized data warehouse implementations, with good reason. The success rates are much higher for a pure star schema solution.
Let's explore the benefits and deficits of each structure:
Normal (3rd- 5th)
The traditional justifications for normalization either don't apply to data warehouse design or are down right dysfunctional.
- Eliminate redundancy - Eliminating duplicate rows is good. Eliminating "redundant" attributes in a star schema dimension table will actually destroy its high- access efficiency. When this concept is applied to reducing the total number of tables, it runs counter to the data warehouse principle of optimization via specialization. A summary table is not redundant if it supports more efficient access by multiple users. A maintained summary can be used to accurately reproduce historical results and improve consistency. The desire to eliminate redundancy is the most pernicious fallacy carried over from application design.
- Save space - This corollary to eliminating redundancy is a holdover from another era. Saving cost by saving space is low on the totem pole of optimization techniques. The relative impact of storage on cost is way down. The loss of access efficiency has far greater cost impact.
- Support efficient update - Does not apply at all since update is not allowed. The load methods for relational tables in a star schema design can actually be more efficient than a load of normalized transaction and snow- flaked reference data.
- Avoid corruption - The use of DBMS controlled referential integrity is actually counter-productive. Referential integrity is designed to support the original capture of data in the source system. A data warehouse needs to detect, flag and possibly correct data anomalies BUT rejecting incoming records can only reduce the integrity of results produced from DW data.
- Representation advantages - Entity/relational modeling evolved to ensure completeness of the design for the intended purpose and to offer a more intuitive representation of the real world. In the ambiguous world of data warehouse information needs, no technique can ensure completeness of design. In fact, the standard methods dictate that you only include attributes and relationships for which there is a current expressed need. This is suboptimal for data warehouse design. As for offering a more intuitive representation, it seems clear that dimensional design wins hands down.









