In my November, December and January columns and the data quality article published in the October issue of DM Review, I laid a solid foundation for building a data warehouse. We have done a careful job of gathering all the overlapping design constraints; we have established a good set of boundaries with all the groups we interact with; we have captured a perfect subset of changed data to feed our data extraction; and we have built a powerful architecture for cleansing the data once it is under our control.
Our next big task is to divide the data into dimensions and facts. We call these designs dimensional models. Dimensions are the basic stable entities in our environment, such as customers, products, locations, marketing promotions and calendars. Facts are the numeric measurements or observations gathered by all of our transaction processing systems and other systems. End users instinctively understand the difference between dimensions and facts. When we deliver data to our business intelligence (BI) tools, we take great care to make dimensions and facts visible at the user-interface level in order to exploit the users understanding and familiarity with these concepts. Perhaps another way to say this is the dimensional data warehouse is the platform for BI.
Dimensions and facts drive the user-interface experience in the BI tools. Dimensions are overwhelmingly the target for constraints and the source of row headers in the BI tool results. Facts are overwhelmingly the grist for computations. Separating the dimensions and facts structurally in the data is very helpful because it encourages consistency in application development and the BI tool user interfaces.
Dividing the world of data into dimensions and facts is a fundamental and powerful idea. Ninety-eight percent of all data items are immediately and obviously categorized as one or the other. Discrete textual data items that describe the attributes of our stable entities belong to dimensions. Repeated numeric measurements whose values are not fully predictable are facts. Thus, if we sell a red magic marker for $1.79, then red is an attribute in the magic marker row in the product dimension, and $1.79 is an observed fact.
The foundation of the data warehouse is the measurement event that produces a fact record. This is a very physical, tangible result. A fact record exists if and only if a measurement event takes place. This physical result is used by the data warehouse designer to make sure that the design sits on solid rock. When we describe the measurement in physical, real-world terms, we call this the grain of the fact table. If you are quite sure of the grain, you will have a relatively easy time designing the fact table. That is why I keep telling students to keep to the grain.
When a measurement event creates a fact record, we scramble to attach contemporary versions of all the relevant dimensional entities to this fact record. When we sell the red magic marker for $1.79, the flash bulb goes off, and from this snapshot we assemble an impressive set of dimensional entities, including customer, product, store location, employee (cashier), employee (store manager), marketing promotion, calendar and maybe even the weather. We are careful to use up-to-date versions of the dimensions so that we are describing this sales measurement event correctly. Notice that the grain of this measurement is the cash register beep when the item is scanned. Later in the design process, we implement this grain with various foreign keys connecting to the dimensions, but we dont start the design process with the keys. We start with the physical event.
Once we have the grain of the fact table firmly in mind, we make sure that the only facts introduced into our fact records are defined by the scope of the measurement event. In our cash register example, the instantaneous price of the product and the number of units sold are good facts that are true to the grain. But total sales for the month or the sales on the same day last year are not true to the grain and must not be included in the physical fact record. Sometimes it is hard to resist adding facts that are not true to the grain because they provide a shortcut for a specific query, but these rogue facts always introduce complexities, asymmetries and confusion for the application developer and the end user. Once again: keep to the grain.
Whenever possible, we strive to make facts additive. In other words, it makes sense to add the fact across records. In our retail sales example, although the price is true to the grain, it is not additive. But if we instead store the extended price and the quantity sold, then both these facts are additive. We can instantaneously recover the unit price with a simple division. Forcing facts to be additive whenever possible seems like a small point, but it is one of the many ways we make our BI platform simple. Like the famous Japanese car example of quality, a thousand little improvements eventually becomes a sustainable strategic advantage. Conversely, a thousand little gadgets shoehorned into a database to make certain queries simpler will produce an unworkable, unmaintainable design.
In a similar vein, we resist taking normalized data models all the way into the BI environment. Normalized data models are essential for efficient transaction processing, and they are helpful for storing data after the data has been cleansed. But normalized models are not understandable by end users. Before you lapse into religious wars with your colleagues, please recognize that when correctly designed, normalized models and dimensional models contain exactly the same data and reflect exactly the same business rules. Any and all data relationships can be accurately represented using either methodology. Thus, the reason for using dimensional models is that they form a proven, workable basis for BI.
Earlier in this column, I stated that 98 percent of all data items are immediately and obviously categorized as either a fact or a dimension attribute. What about the remaining 2 percent? Perhaps you have been thinking that in our retail sales example the price of the product should actually be in the product dimension, not in the fact table. In my opinion, upon a little reflection, this is an easy choice. Because the price of a product often varies over time and over location, it becomes very cumbersome to model the price as a dimension attribute. It should be a fact. But it is normal to recognize this rather late in the design process.
A more ambiguous example is the limit on coverage within an automobile insurance policy. The limit is a numerical data item, perhaps $300,000 for collision liability. The limit may not change over the life of the policy, or it changes very infrequently. Furthermore, many queries would group or constrain on this limit data item. This sounds like a slam dunk for the limit being an attribute of the coverage dimension. But, the limit is a numeric observation, and it can change over time, albeit slowly. One could pose some important queries summing or averaging all the limits on many policies and coverages. This sounds like a slam dunk for the limit being a numeric fact in a fact table.
Rather than agonizing over the dimension versus fact choice, simply model it both ways! Include the limit in the coverage dimension so that it participates in the usual way as a target for constraints and the content for row headers, but also put the limit in the fact table so it can participate in the usual way within complex computations.
This example allows me to summarize this months column with an important principle: Your design goal is ease of use, not elegance. In the final step of building dimensional models that are intended for consumption by end users, we should be willing to stand on our heads to make our BI systems understandable and fast. That often means transferring work into the extract, transform and load back room and tolerating more storage overhead in order to simplify the final data presentation.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access