A data model is one of the building blocks of the data warehouse. Starting with the logical data model, the data warehouse data model is developed; and the viability of the data warehouse is dependent on the design of this model. A question that has occasionally been asked of me is, "How do I know if my data warehouse data model is good?" Detecting a poor design is much easier than confirming a good design. A poor design can often be determined through a violation of data warehouse data modeling rules. A good design cannot be confirmed without an examination of the content. This column addresses two important characteristics of a good data warehouse data model--consistency with the logical model and relevance. Other characteristics will be addressed in future columns. A data warehouse data model is a particular type of data model. As such, it must observe sound data modeling practices pertaining to entity and attribute names, entity and attribute definitions, diagram conventions, etc. This column focuses on the characteristics that are essential in good data warehouse data models.
Consistency with the Logical Model
The first important characteristic of the data warehouse data model is that it must be semantically consistent with the logical model.1 The logical model describes the business entities and the relationships among them, independent of any system implementation. The consistency requirement applies at the entity level, the attribute level and the relationship level.
Consistency at the entity and attribute level does not mean that every entity in the data warehouse model must be present in the logical model. There are legitimate reasons for differences at both levels. The logical model is typically created in pure third-normal form. The data warehouse model design considers the role of the data warehouse, and reflects some degree of denormalization and design for performance. One of the steps in creating the data warehouse model is the addition of entities for summary or derived data. 2 These entities can be related to the logical model, but they need not exist in that model. Similarly, an entity that represents the result of a calculation (net sales amount, for example) may exist in the data warehouse model even if it is not present in the logical model.
The business relationship between entities is described in the logical model. One such relationship may be that an employee must always be assigned to a department, and cannot be assigned to more than one department at any point in time. Remaining consistent with this relationship in the data warehouse data model dictates the use of a foreign key relationship between the department and the employee. Further, this relationship is non-identifying and mandatory. The retention of history in the data warehouse will track the information over time.
One of the greatest causes of data warehouse failures is the tendency to include too much in the data warehouse--either initially or eventually. The data warehouse data model needs to be constrained to the data needed to support the objectives of the data warehouse. 3 The end users of the data warehouse cannot fully define their requirements, but this does not justify wholesale inclusion of every data element. The answer to an open-ended question, "What do you want in the data warehouse?" is predictable--"Everything!" Instead of asking this question, the analysts need to understand the information issues being addressed and the business questions being asked. With this information, the business and data analysts can work together to identify the data elements that are needed in each iteration of the data warehouse.
The classic data warehouse is designed to support the strategic decision-making process of the company. This orientation dictates that data needed solely for operational activities should be excluded. Some data is obviously operational, and some data is obviously needed to support the strategic decisions. Many elements, however, cannot be quickly classified either way. Names are typically not strategic, but they are often required as labels on reports and are, therefore, frequently included in the data warehouse. Full address information is typically not needed to support strategic decisions. A ZIP code (or city, state, country) is often appropriate, but inclusion of the full street address can often be avoided.
Including too much data in the data model can significantly add to the project duration and cost. Each data element must be defined and placed within an appropriate entity. In addition, analysis to determine the source system of record needs to be performed, history retention requirements need to be determined, and transformation and cleansing processes need to be developed. The activities required to absorb each element can require hours--or even days--of effort. If the element is controversial, the resolution of the differences of opinion can even take months!
In addition to the time required to define each data element, there is a significant downstream impact on the data warehouse. The street address of a customer provides a good illustration. Including the full street address can add 100 characters of information to each customer record. For a company with 1,000,000 customers, this increases the data warehouse size by 100 megabytes. While disks are cheap, the impact on the load process can be significant if the data is volatile or if the source systems do not easily detect changes. While 100 megabytes may not seem like much, this is the impact of including just one additional set of elements.
The data warehouse data model must observe basic data modeling conventions. The absence of certain characteristics is often indicative of a poor data warehouse model. A mechanically correct model may still be deficient, but analysis to uncover that requires an understanding of the data and its planned use. This article reviewed two important characteristics of a good data warehouse data model. The model must be semantically consistent with the logical model to ensure that the business relationships are preserved. Additionally, data included in the model should be relevant to the purpose of the data warehouse. Violating the first characteristic can lead to a warehouse that does not accurately represent the business; violating the second can have a significant impact on the project duration, cost and the resultant warehouse's operating characteristics.
1 Geiger, Jonathan G. Why Five Types of Data Models? DM Review. May 1997.
2 Inmon, W. H. Summary Data: The New Frontier. DM Review. May 1996.
3 Inmon, W. H. Using the Data Model in the Data Warehouse Design. DM Review. May 1997.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access