It seems as though every dimensional modeler in the world loves to take a pot shot at the entity-relationship (ER) model. It also seems that the same dimensional modelers would benefit from a chat with Yogi Berra, who said: "In theory there is not much difference between theory and practice. In practice, there is." There is another saying, "Don't knock it 'til you've tried it."
This article is about analytical modeling. It will reexamine the concept of data design for analytical systems such as the data warehouse. It will take a close look at dimensional modeling and define its proper role and context. It will position ER modeling, dimensional modeling and other forms of modeling into a general framework. It will openly ask some questions. Is dimensional modeling the end-all and be-all of data warehousing? Or, is dimensional modeling one of the great con jobs in data management history?
Properly addressing the matter of analytical modeling first requires a number of definitions. Without these, we would be talking in circles. Next, we will present and briefly discuss some special and difficult issues concerning analytical modeling.
Levels of Data
For years, data management people believed that there was only one real, persistent level of data the operational level. All other data, while accepted, was derivable from this level. This is not true. There are several levels of data within an organization. The reason stems not from information technology (IT), but from business.
Classically, there are three major levels of management and decision making within an organization: operational, tactical and strategic (see Figure 1). While these levels feed one another, they are essentially distinct. Operational data deals with day-to- day operations. Tactical data deals with medium-term decisions. Strategic data deals with long- term decisions. Decision making changes as one goes from level to level. At the operational level, decisions are structured. This means they are based on rules. (A credit card charge may not exceed the customer's credit limit.) At the tactical level, decisions are semi-structured. (Did we meet our branch quota for new loans this week?) Strategic decisions are unstructured. (Should a bank lower its minimum balances to retain more customers and acquire more new customers?)
Figure 1: Levels of Analysis
Corresponding to each of these decision levels are three levels of data. These levels of data also are separate and distinct again, one feeding the other. Not all strategic data can be derived from operational data. In an organization, there are at least four different kinds of data, including: internally owned, externally acquired, self-reported and modeled. External data, such as competitive data, is obviously acquired from outside agencies. Modeled data is data that is mathematically created (e.g., data created by analyzing customer characteristics and market demographics, and producing such measures as market segments). External and modeled data do not exist in the operational environment. Strategic data is usually comprised of internal and external data, roll-up hierarchies and modeled data.
Levels of Analysis
There are many levels of reporting and analysis that range from fairly structured to quite unstructured. These levels are exemplified in Figure 2.
Figure 2: Levels of Reporting and Analysis
Characteristics of Analytical Data
Analytical data has its own characteristics. It is management-oriented, historical, query-oriented and integrated.
Management-oriented. Analytical data focuses on management measures. To do this, it often uses different grains of data, such as transactions, periodic snapshots and summaries. Management data is often rolled-up to different levels. Management also requires some cross-functional information. External data is often used to supplement internal data.
Historical. Management needs several years of historical data to reveal trends. This allows year-to-year comparison and reveals patterns over time. Therefore, changes in facts and dimensions need to be kept over time. A common historical requirement is to allow restated and non-restated versions of the data. This allows management to pose changes and test the effect of the changes on the business. It allows them to restate the past in terms of the present or the present in terms of the past.
Query-oriented. Analytical data is not used for transaction processing and maintenance but for reporting and different forms of analysis, such as mining. It is mostly read-only. It is not necessarily totally read-only, because some data could be changed as new information is discovered. The analytical environment needs to be able to support a wide range of query types, such as ad hoc, prescribed ad hoc and standardized. The warehouse can be queried directly or used to supply extracts.
Integrated. Proper analysis and reporting requires data from multiple relevant sources. These can be internal, external, self- reported and even modeled data sources. The data must be reconciled to ensure its quality. This means that it must be cleansed to produce data of good quality. The integration of disparate data is one of the main challenges in the analytical environment.
Data Modeling Definitions
Before proceeding, it is essential to establish a number of basic data management definitions because there is so much confusion and conflict around them. Bear with this discussion; it is important.
An ER model is a logical and graphical representation of the information needs of an organization. There are three main processes involved in ER modeling: classifying, characterizing and interrelating. The objects of interest to the organization are grouped into mutually exclusive classes called entities. These classes are assigned characteristics that describe them, called attributes. An important attribute is the identifier or key. Finally, one class or entity can be associated with another via connections called relationships.
A logical model is a representation of a business problem, without regard to implementation, technology and organizational structure. The purpose of a logical model is to represent the business requirement completely, correctly and concisely. A constraint of this model is that all redundancy is removed in order to focus purely on the business requirements and rules. A logical model does not presuppose the granularity of the data; this means that a logical model does not require that the data be only at a very detailed operational level. A logical model is not implemented; instead, it is converted to a physical model against which optimizations are performed, and this is implemented.
Which of the models in Figure 3 is an ER model? Both are. The model on the left is obviously an operational model. The model on the right is obviously an analytical model. It is relatively clear that the model on the left is atomic. However, we don't know about the one on the right. It could be the lowest level we decide to keep, in which case it is atomic to us. Here's an additional bit of business information: we buy this data "as is" from an outside agency. In this case, it is an atomic ER model of analytical data and cannot be decomposed further.
Either of these models could be implemented as is; however, usually ER models undergo transformation to physical models before being implemented.
A physical model is the specification of what is implemented. Physical models should be optimized, efficient, buildable and robust. Figure 4 is an example of a possible physical model of the model on the left in Figure 3.
Logical to Physical Transformation
The conversion to a logical model depends on many factors especially the size and complexity of the data, the complexity of the queries and the number of users. Other factors are shown in Figure 5.
Figure 5: Factors for Conversion to a Logical Model
The conversion from logical to physical models can be simple or complex, depending on the requirements. As is shown in Figure 6, a logical model undergoes several transformations as it progresses from a purely logical model to a physically implemented model. Following is a brief review of some of the possible transformations. We cover three forms of optimizations or trade-offs: safe, aggressive and technical. A trade-off is the emphasis of one feature, which becomes an advantage, against another feature, which then becomes a disadvantage. Trade-offs reflect the simple principle of TANSTAAFL: there ain't no such thing as a free lunch.
Figure 6: Progression from Logical Model to Physically Implemented Model
Safe trade-offs do not introduce redundancy or any integrity compromises. They merely combine or split entities. For example, one could: combine two entities into one, or collapse one entity into another; split entities into multiple tables based on usage; or violate 1NF by including arrays of data or repeating groups of data.
Aggressive trade- offs do compromise integrity and/or non-redundancy. For example, one could: store derived data, including individual summaries and aggregate tables; add redundant data and relationships; or replace natural keys with surrogate keys.
Technical trade-offs are DBMS options, structures or parameters. Most of these choices involve subtle compromises. For example, one could: add indices, which can improve query performance but degrade load performance; adjust buffer pool and heap sizes; or replicate data across multiple processors.
An important observation here is that all models, whether they be operational or analytical, can take some path through this process.
The factors for both models are the same. It is the value of those factors that changes. For example, the ratio of Customer to Orders tables in online transactional processing (OLTP) is relatively low (e.g., a customer places fifty orders on average in a year). In the analytical environment, the ratio can be much higher because over time, a customer could place hundreds of orders. The number of occurrences of most tables, such as Orders tables, in OLTP is also relatively low. In the analytical environment, it is not uncommon for an Orders table to reach several billion. Tables with these volumes and ratios need to be carefully designed for large-scale query performance.
Dimensional Modeling Definitions
A dimensional model, such as the one in Figure 7, is a form of analytical design (or physical model) in which data is pre-classified as a fact or dimension. The purpose of a dimensional model is to improve performance by matching the data structure to the queries. People use the data by writing queries such as, "Give this period's total sales volume and revenue by product, business unit and package." Access can be inside-out or outside-in. When access occurs from dimension to fact, it is outside-in. "Give me total sales in volume and revenue for product XYZ in the NE region for the last period, compared to the same period last year" is outside-in. Access can also be inside-out, in which case the query analyzes the facts and then retrieves the appropriate dimensions. For example, "Give me the characteristics of term life policies for which the insured amount is less than $10,000" is inside-out.
Figure 7: Dimensional Model Example
A particular form of a dimensional model is the star schema, which consists of a central fact table containing measures, surrounded by one perimeter of descriptors, called dimensions. In a star schema, if a dimension is complex or leveled, it is compressed or flattened into a single dimension. For example, if Product consists of Product, Brand and Category, Brand and Category are compressed into Product. This compression causes some redundancy, but can sometimes improve performance. In a star schema, related complex dimensions are denormalized to the extent that they are flattened into a single dimension.
Another version of the dimensional model is the snowflake. In a snowflake model, complex dimensions are re-normalized. A snowflake model is a model in which a given dimension has relationships to other levels of the same dimension. In the snowflake, the different levels or hierarchies of a dimension are kept separate. In Figure 8's model, Product, Brand and Category would be maintained as three separate but related tables.
Figure 8: Snowflake Model
Denormalization and the Dimensional Model
The fact that both the star and snowflake are physical schemas can be illustrated by examining a sample model. A dimensional model typically (not always) uses denormalization, as shown in Figure 9:
Figure 9: Dimensional Modeling Using Denormalization
- It violates 3NF in dimensions by collapsing higher-level dimensions into the lowest level as in Brand and Category.
- It violates 2NF in facts by collapsing common fact data from Order Header into the transaction, such as Order Date.
- It often violates Boyce-Codd Normal Form (BCNF) by recording redundant relationships, such as the relationships both from Customer and Customer Demographics to Booked Order.
- However, it supports changing dimensions by preserving 1NF in Customer and Customer Demographics.
This systematic degree and pattern of denormalization is reserved for physical models. The important conclusion from this (and other discussions in this article) is that the star schema and snowflake schema are not logical models but physical models.
Is a Fact Always a Fact?
A problem with the dimensional model is that data must be preclassified as a fact or a dimension. In addition, only dimensions are permitted to have indices. While this sometimes works, it just as often doesn't. This author prefers the view that data is data and that it is a fact or a dimension relative to the query within which it is used, not in and of itself. For example, we have Delivery Items and Order Items in our data warehouse, each as its own star schema. In many ways, this works fine until we need to relate Delivery Items and Order Items. We do this when we conduct exception reporting. (What Orders did not result in Deliveries?) For this, we need to relate Delivery Items and Order Items. Order Items, which was a fact table, is now effectively a dimension in the exception query. We are told we should not do this; this is a fact-to-fact join. However, in most accounts, we have not had any difficulty doing fact-to-fact joins.1
Dimensional modeling has strengths and weaknesses. There are three ways to improve performance: use better hardware, use better software and optimize the data. Dimensional modeling uses the third method. The primary justification for dimensional modeling is to improve performance by compromising the data to compensate for the inefficiency of technology. A secondary purpose is to provide a consistent base for analysis. Dimensional modeling comes with a price and with restrictions. There are times and places where dimensional modeling is appropriate and will work, and other times and places where it is inappropriate and will actually interfere with the goals of a warehouse.
Placement of the Dimensional Model
Figure 10 summarizes the placement of logical, physical and dimensional data models in the overall spectrum of development.
Figure 10: Placement of the Dimensional, Logical and Physical Models
As you can see, the logical (or ER) model and dimensional model do not cover the same problem domain. The ER model is a logical model and has as its purpose to represent the business. The dimensional model is a physical model and has as its purpose to be an efficient design. A judgmental comparison of the two is impertinent. In fact, it is invalid. People should stop doing it. Writers should stop writing about it. Vendors should stop selling it.
Applicability of the Dimensional Model
The dimensional model is very appropriate to certain components of the data warehouse environment. Specifically, it is appropriate for most forms of data marts. Imbedded data marts, which are usually stored aggregates, are inherently dimensional in structure. Certain structures within the enterprise data warehouse (EDW) are also amenable to the dimensional model. These are driven by query needs. For example, if querying would regularly require joining data across many tables and rolling that up as well, a prejoined dimensional structure would be useful.
Where do the various models fit into the overall data warehouse environment? The dimensional model is appropriate: for most forms of data marts, except where data mining, case reasoning and the like are involved; when the platform, even for the central data warehouse, cannot support a more normalized model; and for aggregates and summaries, which are inherently dimensional in structure.
The normalized model is more appropriate: for the central data warehouse; when the platform is capable of sustaining it; when the data is of general purpose; and where pure ad hoc must be supported.
Technology has a significant effect on the choice of data model. One must admit that a dimensional model is a conscious compromise to improve query performance. Suppose that technology was not an issue. Suppose that the technology was so fast it could do a 27- table deep join and still give satisfactory performance. This is a real example as related by Al Messerli, formerly CIO of 3M, who performed it successfully on the Teradata platform. However, if one does not have such technology, then care must be taken. While I still believe the central DW model need not be a fully dimensional model, it should not be a fully normalized model. All models should be optimized based on the requirements of the business.
To make matters worse, the data warehouse industry suffers from a host of double meanings that make it difficult to communicate meaningfully. It is not uncommon for two gurus to disagree about something without realizing that they are not talking about the same thing. Because of this, it is actually necessary to start over and define some terms.
In my opinion, neither Bill Inmon nor Ralph Kimball uses the term "ER model" correctly.2 Kimball often uses the term "ER Model" as synonymous with a normalized, operational model. Additionally, Kimball presumes that ER models are built as is; however, they are not. ER models are converted to physical models and the physical models are built. An ER model can be used to represent any business problem domain, whether it is analytical or operational. In addition, an ER model can be logical or physical. Further, any logical model can be denormalized when it is implemented. Therefore, comparing an ER model to a dimensional model is not comparing apples to apples. Such a comparison is inappropriate because it assumes that the logical ER model is also the physical model. It implies that a purely logical model undergoes no transformation when being implemented. Nothing could be further from the truth. In fact, a dimensional model is a physical ER model that has been optimized to a specific pattern, namely, the pre-classification of data as facts or dimensions and in some cases, the flattening of related dimensions.
In addition, it is my opinion that the term logical model is often used incorrectly to mean an operational logical model. Some people contrast the warehouse model with the logical model. This is incorrect, in my opinion. Again, even an analytical problem can be represented with a logical model, consisting only of analytical data. In reality, a logical model is distinct from a physical model. The logical model represents the business information needs of the organization, independent of implementation. A physical model represents that which will be or is implemented, and is optimized.
The World Is Not A Star
The world makes up its own patterns, driven by the chaos theory.
The dimensional model is based on the principle that knowledge consists of structurally associating dimensions with facts. The premise for this, like it or not, is that the questions are reasonably predetermined. It is also based on the belief that compressing data yields data that performs better and is more user-friendly. Frankly, this is derived from an old database design strategy: make the data look like the report. In this author's direct experience with many data warehouses, the exclusive reliance on star schemas will unduly hamper application scalability, cross- functionality and long-term expandability. In addition, using a strict dimensional model in some environments, such as MPP, would completely squander the power of that resource. Many queries do not fit into the category of predefined facts and dimension, such as: Give me the most significant characteristics of buyers placing complete orders of $100 or more.
Two Different Psyches
There is an old proverb: Give a man a fish and you feed him for a day. Teach a man to fish and you feed him for life. This proverb has direct applicability to dimensional modeling versus data modeling. The dimensional modeler and data modeler have two different mind-sets.
Dimensional modelers have been taught to rely on patterns such as stars and snowflakes. They make design decisions based on those patterns. If something is genetically aberrant, either they disallow it or invent another pattern to cover it. For example, data is classified as a fact or dimension. When queries join across two facts, this is a fact-to-fact join. When a fact has no counts or amounts, it is called a factless fact table. Dimensional models are stars or snowflakes. When multivalued dimensions violate the star pattern, "helper" tables are added to rationalize the star. Real-life business, even reporting and analysis, does not fit into set patterns. It creates its own character and needs. The world is not a star.
The (ER) normalized modelers have been taught to rely on principles. The data of an organization, even the analytical data, has inherent relationships based on the rules and requirements of an organization. They make decisions by applying these principles. They solve the problem by analyzing the need and applying the principles. For example, customer data is identified by Customer ID. If customer data changes over time, then it cannot be dependent solely on Customer ID. It must be dependent on Customer ID and Date. The principle of 1NF is applied. If products come in different colors, then color should be dependent on Product, not Sales Fact. It should be an attribute of Product. Functional dependency tells us this. When the relationship between two entities is many-to- many, an associative entity is created to resolve it.
As might be expected, both data modeling and dimensional modeling have their extremes. The dimensional modeler passionately rejects things that do not fit into his patterns. By not accepting (or even trying to understand) basic data management principles, he is throwing the baby out with the bath water. On the other hand, the classical data modeler has often forced things into needless abstraction or needless decomposition. The data modeler has enforced the rule of atomic data with excessive rigidity and with blinders on. This has obstructed the whole progress of data management from recognizing the separate and legitimate existence of analytical data for more than a decade.
Next month, in part two of this article, we will deal with some key issues in analytical modeling including: surrogate versus natural keys, granularity, functional dependency, changing dimensions and history, wide versus skinny dimension tables, separating data based on data usage and helper tables.
1. The platform used was IBM RS6000 with S80s and UDB EEE.
2. There is a common misconception among data warehousing authors such as Ralph Kimball and Claudia Imhoff that Dr. Codd and Chris Date invented ER modeling. They did not. In fact, they did not even like it because it did not adequately address integrity rules. ER modeling was formalized by Peter Chen, extended by Charlie Bachman and popularized by Clive Finkelstein and others.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access