In Part 1 of this article (see the March issue of DM Review), we established basic and essential definitions for analytical modeling. In this second part, we will address a number of issues critical to analytical modeling of real applications. Along the way, we will discuss how they are handled in logical modeling, physical design and dimensional design. (Figures 1-10 are in Part 1 of the article.)
Surrogate Keys versus Natural Keys
Surrogate keys are often touted to be a means to database (DB) performance. A surrogate key is a single immutable, non-intelligent identifier used to substitute for the natural key, especially (but not only) a compound natural key.
Surrogate keys should not be confused with artificial logical keys. It is very difficult to find natural keys for primary entities such as Customer, Employee and Vendor. Therefore, we create artificial keys to make the entity unique, such as Customer ID, Employee ID and Vendor ID. These IDs are not surrogate keys because they have a business usage. "What is your customer ID?" "Please enter you Employee ID followed by the # sign." "Where is the payment for Vendor ID 12345?" In addition, these artificial keys may not be immutable over the lifetime of the entity. Surrogate keys are totally artificial and hidden, and are for system use.
For primary entities such as Customer, surrogate keys allow us to retain history. The Customer data would include Customer SK, Customer ID, Customer Name, etc.
Let us look at Order Line Item as an associative entity example. The natural compound key for Order Item could be Order Number, Product ID. A surrogate key for Order Line Item could be Order Key (integer) used to substitute for the compound key of Order Number, Product ID (see Figure 11). Surrogate keys can be used in any physical model. Database management system (DBMS) optimizers generally treat a surrogate key more efficiently than a compound natural key. Retrieving data and joining tables is more efficient with a surrogate key than with a compound natural key. However, a simple numeric natural key will perform equally to a surrogate key.
Figure 11: Natural Order Item Key and Surrogate Order Item Key
Surrogate keys introduce several complexities not found in natural keys. For example, whereas a natural key will ensure the integrity of the data, a surrogate key requires that integrity be separately enforced. In our example, the natural key of Order Number, Product ID will inherently ensure that a product appears in an order only once. If we use Order Key as the primary key, we have to use an integrity routine to enforce that rule. Other questions arise: Should the affected table include both the surrogate key and the natural key, thereby increasing storage requirements? If you remove the natural keys, do you add a mapping table? How much more complex is it if you query by the natural key but join via the surrogate key?
In a data warehouse (DW) environment, the use of surrogate keys is not a simple decision. It affects the full spectrum of DW processes, namely, the gather, store and deliver processes. The gather process is affected because the keys must be generated and controlled. The store process is affected because the DB must store the surrogate keys and then ensure the integrity of the data via integrity modules. The deliver process is affected because the reporting and analysis must recognize the occurrence of histories and know how to use them.
In summary, logical data models use natural compound keys to help reflect the identity of the data and to ensure integrity. The model itself and its keys inherently ensure this. In a physical data model, this is often replaced with a surrogate key for various reasons.
The main reasons to use surrogate keys are to:
- Save storage by replacing a compound natural key,
- Insulate the table from operational vagaries and
- Allow infinite granularity.
Pros and Cons of Surrogate Keys
- Reduced space requirements by replacing a large combined primary key either in parent or child entities.
- Isolation of the data from operational changes.
- Ability to have multiple instances against a given entity.
- Speed of access when the optimizer uses a simple, numeric index.
- Possible extra storage requirements due to an extra column (if both natural and surrogate keys are retained).
- Integrity must be enforced some other way.
- Still need access to the data using natural keys.
- May have to supplement surrogate keys with a mapping table correlating surrogate key and its equivalent natural key.
In the analytical environment, it is quite common that the natural keys of facts and of historical dimensions become very large. Surrogate keys can be used to address this. The surrogate key allows the natural key, consisting of many parts, to be replaced by a single identifier. This identifier could be used as a foreign key in other relationships. For example, say we need a cost of goods sold (COGS), and that COGS is determined by Product ID, Location ID and Manufacture Method, and that it varies over time. The natural key is a combination of these attributes: Product ID, Location ID, Manufacture Method Code, Date. We keep these attributes in the table as non-primary-key-foreign keys, but use the surrogate key as the table key, and thereby as a foreign key in other relationships. In Figure 12, the model on the left shows the extra space consumed by natural keys; the model on the right shows how less space is consumed by the surrogate key.
Figure 12: Natural Key versus Surrogate Key
Granularity is very important in any model, but especially the analytical model. Granularity is defined as the level of detail of the data. Proper understanding of granularity is essential to understanding several of the topics to be discussed later. The more granular the data, the more you can do with it. Excessive granularity brings needless space consumption and increases complexity. You can always aggregate details; you cannot decompose aggregates.
Granularity is not absolute or universal across industries. This has two implications. First, grains are not predetermined; second, what is granular to one business may be summarized for another.
For example, we cannot say that a given grain (e.g., the transaction) is necessary in all data warehouses. What is granular from one perspective may not be granular from another perspective. However, most businesses will benefit by some general base grains. The two primary grains in analytical modeling are the transaction and the periodic snapshot. The transaction will serve most businesses such as selling, direct marketing, brokerage and credit card. The periodic snapshot will serve periodic balance businesses, such as insurance and banking. (This does not imply that insurance and banking would not benefit by also keeping the transaction.)
Granularity is determined by the number of parts to a key and the granularity of those parts. Adding elements to an existing key always increases the granularity of the data; removing any part of an existing key decreases its granularity. Using Customer Sales to illustrate this, a key of Customer ID and Period ID is less granular than Customer ID, Product ID and Period ID. Say we need to keep Vehicles Produced as a measure for an auto manufacturer. A key of Vehicle Make, Factory ID, Engine Size and Date, is less granular than a key of Vehicle Make, Factory ID, Engine Size, Emission Type and Date.
Granularity is also determined by the inherent cardinality of the entities participating in the primary key. Take Organization, Customer and Order as examples. There are more Orders than Customers, and more Customers than Organizations. Data at the Order level is more granular than data at the Customer level, and data at the Customer level is more granular than data at the Organization level.
Changing dimensions are a fact of life. They are a business characteristic and requirement. Over time, Product attributes change, as do Customer characteristics. Changes, whether rapid or slow, hot or cold, fat or skinny, are well modeled in the entity-relationship (ER) model. The concept of granularity and normalization allow this to happen. One should also assess the value of surrogate keys in allowing you to support changing dimensions.
In my opinion, changes to dimensions are not the big deal dimensional modelers often make of them. For example, Ralph Kimball has said: "The slowly changing dimension approach remains completely consistent with the basic structure of dimensional modeling, in which the fact table of measurements is surrounded by a set of dimension tables, all connected by a foreign key: primary key relationships. The entity-relation modeling approach does not specifically address slowly changing entities."
This last statement is genuinely entertaining. Logical modeling does not need to specifically spell out how to handle changing dimensions because it has established the fundamental principles to deal with them. As an example, an ER modeler encounters a situation where, for a given entity, the attributes vary over time and he needs to keep track of the changes. Customer, for example, is identified by Customer ID with Dominant Vehicle Lifestyle Indicator and Networth as attributes. Dominant Vehicle Lifestyle Indicator classifies a household according to its most expensive vehicle; Networth defines the total liquid assets of the household. Dominant Vehicle Lifestyle Indicator and Networth vary over time. Given these rules, no ER modeler on the planet would knowingly fail to move these into a separate entity and to choose an appropriate natural identifier, such as Customer ID and Date. The reason this can be said is that this concept of slowly changing dimensions is merely an application of first normal form (1NF), which is based on functional dependency.
The rule for 1NF is simple: when an attribute (or group of attributes) occurs a variable number of times within an entity, remove the repeating data to a separate entity, bring along the key of the entity you took it from and combine this with some additional attribute to provide uniqueness and granularity -- thereby the identifier of Customer ID, Date.
Returning to our example, given a Customer, identified by Customer ID and a set of attributes apparently dependent on Customer ID, if we discover that these attributes change over time, then they are not functionally dependent on Customer ID. Instead, they depend on Customer ID and something else, such as Date.
Data modelers know how to do this because they know the modeling rules. In fact, every practicing ER modeler probably applied 1NF on the first models they ever built.
Whether a surrogate key or natural key is used is a design question, not a logical modeling question. In summary, a data modeler would resolve this matter the same way a dimensional modeler does it -- just with a little less fanfare.
Dimensions Changing At Different Rates
Whenever groups of data change at different rates, it is advisable to ask the question: Should I separate them? If the data forms into groups that tend to change at different times from another group and the groups tend to be used together, then there is a benefit in splitting them into different tables. If some customer data is immutable, other data changes infrequently and still other data changes very frequently, it is advisable to consider separating the data into three tables. In effect, from one table we are creating several tables that have a one-to-one relationship (see Figure 13). Combining and splitting tables is a form of a safe trade-off, covered in Part 1 of this article.
Figure 13: Splitting Data According to Stability
In a similar fashion, a modeler should be as much concerned about whether a dimension is fat or skinny as he/she is about whether it changes slowly or rapidly. This is implicit in data design principles.
A table that is fat or wide will have different performance than the same data split into several tables. Data must be split based on data usage. Data that is used together is kept together.
The reason for this is simple. A direct-access storage device (DASD) is divided into sectors for physical storage. DBMSs group rows into pages that fit into these sectors. The skinnier a record is, the more records the DBMS can fit in a page. DBMSs try to read one or more pages at a time. The more records there are in a page, the more data the DBMS can bring in with a single I/O.
ER modeling does not tell you to do this because it is not a logical modeling issue, it is a design issue. Good database design, based on the principles described earlier, has no trouble applying the principles in dealing with matters such as this.
To illustrate this, a recent test revealed that splitting a fat dimension into several smaller dimensions improved performance, and the snowflake schema with its separated dimensions performed slightly better than the star schema with its wide flattened dimensions.
Rapidly Changing Dimensions
Dimensional modelers often distinguish slowly from rapidly changing dimensions. Speaking of these, Kimball has said, "If the behavior metrics [of rapidly changing dimensions] are not textual and discrete, but are numeric and continuous, then you should place them in a fact table."
Let us think about this remark while examining a real-life example. In this example, our customers are mostly wholesale customers, such as stores. When a customer's sales exceed a defined threshold, they can be assigned a national account status (NAS). NAS implies that a customer will receive special discounted pricing and special merchandising assistance. This change in status can happen often. We need to be able to deal with this in our model. Therefore, we bring in our favorite dimensional consultant, who says, "This is a rapidly changing dimension. Put it in the fact table." Our consultant also tells us not to include a code table for status (which is a dimension) because the table would have two values -- national and non-national status.
We do this. We have approximately 150 NAS customers at any one time; we have approximately one billion rows in our sales fact table. We use Redbrick. One key characteristic of Redbrick is that it will not allow indices on non-dimension columns. NAS is a non-dimensional column, as our consultant advised, so we cannot index it. There are two things our consultant forgot to consider. First, we do not look for customer NAS only when querying sales facts. We often do so in and of itself, or as a subselect within a query. This is a big problem. Not being indexed, the DBMS must do a table scan to get the result. In effect, we scan 1 billion records to get a 150-row result set! Second, our consultant did not consider functional dependency, which would have immediately indicated that NAS was logically dependent on Customer ID + Date, and not the sales fact at all.
Using functional dependency, the logical model for Customer would be very simple as a changing dimension (see Figure 14). This yields a simple rule of thumb: Think twice about putting anything extraneous in a fact table because major fact tables are large. Consider the importance of table and column size and data usage. There can be dire consequences when tampering with a fact table.
Figure 14: Logical Model for Customer
The fundamental basis for classifying data is functional dependency. Functional dependency means that the attributes within a given entity are fully dependent on the entire primary key of the entity -- no more, no less. An attribute should depend on the key, the whole key and nothing but the key ...
The implication is that each attribute has a one-to-one relationship with its primary key. For each instance of a primary key, there is one set of applicable values; for each attribute, there is only one primary key that properly gives it its identity. For example, in the customer entity, Customer Name, Date of Birth and Social Security Number all belong to the primary key of Customer ID. Conversely, if a customer can have different addresses, each with a different role (but only one role), then the customer address depends on Customer ID + Role.
One implication of functional dependency is non-redundancy. If the columns within each table contain attributes with proper functional dependency, then the simplest way to apply a change to any column is to rely on the functional dependency. Each change will be applied to one attribute in one place. Therefore, if the customer name changes, we go to the customer row and change the name. If new customer address is used as the mailing address, we go to Customer Address Role and change it there. Functional dependency is not irrelevant in analytical models.
This yields a new rule of thumb: Honor functional dependencies first, then sensibly compromise if necessary. In compromising, consider the full context of data usage. The consultant in the previous example did not consider the full usage of the data; he only considered one usage -- namely, if you are looking at sales facts, you can simultaneously look at status. Consider also the full gather-store-deliver cycle in the DW environment. Don't look just at a limited set of queries. What effect will the change have on our batch window? If we enforce functional dependency in our example, we need to update only one row in one table. If we take our consultant's advice, we are adding a column to the load of a very large fact table.
Dimensional modelers have developed the concept of helper tables. helper tables usually take one of two forms: helper tables for multivalued dimensions or helper tables for complex hierarchies. It is easiest to explain these with two simple examples.
Consider the example where a household can own many insurance policies, yet any policy could be owned by multiple households. The simple approach to this is the traditional resolution of the many-to-many relationship, called an associative entity. Traditional data modeling has never had any difficulty in dealing with this, and that is all this is. The traditional way to resolve a many-to-many relationship is to create an associative entity whose key is formed from the keys of each participating parent (see Figure 15). Here, a new entity is formed, called Household-Policy, with a key of Household ID, Policy ID. To complete the picture, there would probably be additional identifying attributes, such as Role Code because a given household could play different roles for different policies. A household could own one policy, but be a beneficiary of another, a trustee of a third, and so forth. It doesn't matter whether or not dimensional modelers approve of this; the business rules require it. This helper table is merely an associative entity, and there is absolutely nothing special about it. That does not mean it is not important; it is. However, it does confirm that the world is not a star.
Figure 15: Multivalued Dimensions Helper Table
A hierarchy is a tree structure, such as an organization chart. Hierarchies can involve some form of recursive relationship. Recursive relationships come in two forms: self-relationships (one-to-many) and bill of materials (BOM) relationships (many-to-many). A self-relationship involves one table; a bill of materials relationship involves two (See Figure 16).
Figure 16: Complex Hierarchies Helper Tables
These structures have been handled in ER modeling from its inception. They are generally supported as a self-relating entity or as an associative entity in which an associate entity (containing parent-child relationships) has dual relationships to a primary entity.
Physically, a self-relationship is called a self-join. As examples, an employee can manage other employees, a customer can roll up into different customer groups or an organization can be broken into different organization units. These are actually one-to-many relationships of one instance of an entity with other instances of the same entity.
The second of these is generally called a bill of materials structure because that is where it was initially applied. A typical application of these concepts in analytical modeling is to support complex hierarchies. Hierarchies can be used for summing data to different levels.
A part can consist of other parts; a part can go into other parts. The two methods of traversing these hierarchies are explosion and implosion. Explosion navigates from a parent to all its children, such as: What parts make up this Product? Implosion navigates from any child to all of its parents, such as: In what products is this part used?
The problem with recursive structures is not how to model them; data modeling has always done a fine job of that. The problem is obtaining good performance from them. Let us examine each in turn.
The self-join is actually a very competent way to handle this hierarchy requirement even in a DW environment. To allow rapid retrieval, the self-join is usually supplemented with indices on either the child key or the parent key, or both, depending on usage. Recent tests conducted on IBM's UDB and Teradata have indicated that the self-join has acceptable performance. (Please note: physical tests are dependent on the platform.)
The BOM or many-to-many recursive structure is another matter. Other tests have revealed that this does not perform as well - at least using the recursive table function of IBM's UDB EEE. It performs somewhat slower than a self-join. However, there are ways to work around this. One way is to make the BOM into a self-join, keyed on child (or parent) and redundantly carry the ancestor (or descendent) data.
Kimball describes another way as shown in Figure 17. This approach has been called Descendent Tables by the data management industry for some time now. In his article, "Help for Complex Hierarchies" (DBMS -- September 1998), Kimball has an example that will work efficiently. Like all descendent tables, it requires populating all paths from every parent to every child. The traditional hierarchy populates only from every parent to its immediate children. There is one potential problem with the descendent table solution. Because we store every path from every parent to every child, it is possible that in attempting to navigate the hierarchy, some BI tools will redundantly count the data. Some other tools will not be able to use it because they require the data in a different format (such as flattened). This possibility is entirely dependent on the BI tool.
Figure 17: Kimball Example - Descendent Tables Source: DBMS - September 1998
In structured query language (SQL), descendent tables will give the required roll-up results but will not provide an orderly presentation of the entire dimension in correct order by parent and child, as in a kind of outline (top to bottom, left to right).
You may use a universal sequence number that runs throughout the entire dimension hierarchy, from beginning to end, to keep all the parent-child relationships in a given order. In relational technology, tables are not ordered. With a universal sequence number, the structure would be very efficient and could be navigated top to bottom, left to right (or bottom to top), and still provide acceptable performance. It is important to observe that both of these concepts are applications of data modeling principles that have been around for decades.
It is worth looking at Kimball's example for another reason: it shows a many-to-many relationship between the customer and sales entity. It points the customer number in the helper table with the purchasing customer number in the fact. In fact, the model he shows is not a data model at all, but a partial access path map. There really is no many-to-many relationship between Customer-Customer (the helper table) and Sales in any true data modeling sense. The many-to-many relationshiop is between Customer and Customer-Customer (the helper table) and back to Customer. A couple of things can clarify this. In data modeling, when a one-to-many relationship is drawn, the key from the one-end serves as a foreign key at the many-end. It is a deviation from data modeling rules to draw relationships only where columns are joined. A transaction map in an access path can do this, but not a data model. This is another example of what I mean when I say that too often, dimensional modelers do not understand basic data modeling principles. The proper model for this structure, without changing any data or requirements, is shown in Figure 18.
Figure 18 Proper Model
Symmetrical and Asymmetrical Hierarchies
Hierarchies come in two forms: symmetrical and asymmetrical (or ragged). Ragged hierarchies are common in most businesses, especially in Product and Organizational structures. A ragged hierarchy is a hierarchy in which not all child entries have parents at the same level. Ragged hierarchies are asymmetrical. For example, one Product could roll up into Product Sub-Group and then Product Group, and another could roll up into Product Group and Product Family. Ragged hierarchies are quite easy to support in a recursive relational structure, such as a BOM-like structure. The trouble is that recursive structures do not perform well in a relational environment.
Not all levels in a dimensional hierarchy need to be balanced. An account executive, for example, might report to a Market Unit or a Business Unit. In a star schema, a dimension table usually requires fixed columns for each level. Because of this, the star schema has considerable difficulty handling unbalanced or ragged hierarchies. BI tools circumvent this with code.
On the other hand, a pure dimensional model -- especially a star schema -- has great difficulty in supporting ragged hierarchies because the star schema flattens out the hierarchy and it is difficult to store and interpret the levels. Online analytical processing (OLAP) tools circumvent this, but it is the tool that circumvents it with rules and code, not the dimensional structure at all.
Because the star schema creates specific columns to handle each level, it requires that unbalanced hierarchies effectively be balanced so that the account executive, for example, is always at a specific level.
The data management industry should be neither for nor against dimensional modeling. In solving problems, we should not be asking, "Should we be using an ER or dimensional model?" Each has its place, but they are not competing for the same space. Remember, the dimensional model is a physical model applicable to some situations; it is not the only way to solve an analytical problem.
The important thing is to know data management and database design principles and how to apply them. Do not rely solely on patterns and metaphors.
One should be against the judgmental comparison of ER modeling and dimensional modeling because they address different problem domains. One should be against those dimensional modelers who show little knowledge of fundamental data management principles, pontificating to the world that it is acceptable to violate those principles. One should also be against traditional data modelers who always want to force the modeling into a given level of decomposition or abstraction without practical regard to the problem at hand. Data modeling is not an end in itself. The data modeler must learn to recognize the different requirements of analytical data.
In Part 1 of this article, we quoted an old expression: "Don't knock it 'til you've tried it." Here is a slight revision to this slogan: "Don't knock knocking it until you have tried trying it!"
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access