The dimensional model has been widely accepted as a means of modeling certain types of query and reporting requirements, especially in a data mart environment. A dimensional model is a data model in which data is preclassified as a fact or dimension. Its most notable structure is the star schema.
The dimensional model distinguishes facts from dimensions. A fact is a measure of the business, such as sales volume or sales revenue. A dimension is a reference or master data table that describes, identifies or interrelates facts. The typical relationship between dimension levels or between facts and dimensions is many-to-one (M:1). However, some requirements can not be reduced to a M:1, because they inherently involve a many-to-many (M:M) structure. Dimensional modeling has defined two overall patterns for structuring informational data, the star and the snowflake schema.
This article will take a closer look at the M:M in a dimensional model. The relationship can be across a multileveled dimension or between a dimension and the fact it describes. There are essentially four ways to support such a M:M relationship. Each of the solutions is workable but each has its own set of pros and cons. Each involves its own unique trade-offs. Trade-off is the name of the game at this stage, because no matter what solution you choose, you will be compromising one feature in favor of another. One feature will become an advantage, the other a disadvantage. The four solutions are: arrays, groups, bridges and fixed structures. Before delving further into them a discussion of the basic dimensional model structures is required.
The basis for the dimensional model is the phenomenon that many queries can be subdivided into fact and dimension. For example, provide the breakdown of revenue and volume by customer, product and week. A host of other queries do not benefit from this distinction, however. For example, what two products are sold together, most often and when? What are the most important characteristics of customers that are about to lapse? What top 10 customers should I call this week?
A star schema is a structure with a fact table surrounded by a single perimeter of dimensions, wherein any dimension that is multileveled is flattened into a single dimension. There are cases where the star schema is just not appropriate, however, because it is not suitable for the business requirements or for query performance.
Dimensions in a Star
The typical relationship between dimensions and facts in a dimensional model is one-to-many (1:M) - a dimension that applies to many facts. A product can appear in multiple purchase orders, or a customer can place multiple orders. Dimensions often consist of multiple summary levels. When a multileveled dimension is flattened, there is an implied internal structure to the dimension. A common relationship is for the dimension grain to be at the lowest level with the higher levels in order above it. The relationship between lower and higher levels is M:1. Figure 2 illustrates an example. A product group consists of multiple product sub-groups and a sub-group consists of multiple products.
One exception to the star schema occurs when the relationship across dimensions is more complex and it is useful or necessary to open the structure up. Usually this happens if the dimensions do not have a strict 1:M relationship. It can also happen if the dimension in question is very wide in that it has a large number of attributes. It could be best to split them out. This is often called vertical partitioning. The dimensional variation for this is called the snowflake schema.
A snowflake is a dimensional structure in which a central fact is surrounded by a perimeter of dimensions and at least one of its dimensions keeps its dimension levels separate. The snowflake structure is more flexible than the star. One reason to snowflake is to change a complex or large dimension into a more manageable set of smaller dimensions. The snowflake structure will reduce batch updates to dimensions. Though always said to be slower than a star, some tests have revealed no difference in performance between flattened and snowflaked dimensions. In fact in some cases, the snowflake provides superior performance, such as when a wide dimension (i.e., customer) is segmented into a snowflake. Consider Figure 4 for example. There are 300 attributes in a customer dimension and the attributes naturally fall into three classes: base customer, customer demographic and customer financial attributes. Though some people, such as Nick Gallemo, have objected to calling this a snowflake, Ralph Kimball has identified it as one of three allowable cases for a snowflake.
More Complex Dimension Structures
Another example occurs where there is a more complex relationship across dimensions or between facts and their dimensions. Consider the following situations as examples: In banking, a customer can have different accounts, and an account can belong to different customers. In insurance, a customer (or household) can have different policies, but a policy can support multiple customers (or households). In many businesses, a parent customer can have subsidiaries. Such business situations do not fit into the mold of a star schema, because the relationship across the data can not readily be flattened out into a 1: M without some loss of meaning. Dimensional modelers have come up with a term to justify the support for these requirements - bridge tables. A bridge table is a table that supports a M:M relationship across dimension levels or between a fact and its dimensions.
Dimensional modelers usually restrict the term snowflake for structures where the relationship of a dimension member is M:1, such as the Product, Product sub-Group or Product Group. However, there is no inherent reason to restrict the term. In this article, I consider a snowflake to include any dimensional structure in which the relationship across dimensions members is M:M or the relationship between fact and dimension is M:M. Dimensional modelers often describe these structures in terms such as helper tables, bridge tables or factless fact tables to describe different varieties of this. In reality, these are simply complex snowflakes that dimensional modelers have chosen to name to allow them to fit in the dimensional pattern.
In resolving any M:M relationship, a new entity is formed, called an associative entity. The key to the new entity is the key of all participating parents. For the associative entity, all parents must exist. The cardinality of the new entity is determined by the original cardinalities in the M:M, as illustrated in Figure 5.
The most fundamental solution in most of these cases is to snowflake the model with an associative entity. This is an acceptable and natural data modeling solution. In fact, where the relationship across dimension members or the relationship between a fact and its dimension is M:M, snowflaking the logical model is not only an acceptable solution but the required solution. However, because logical models are converted to physical models, there are different ways to implement a design for this situation.
For the following discussion to make sense, a few terms have to be put in place. This article will address four terms in particular: hierarchy, network, regular hierarchy and ragged hierarchy.
A regular hierarchy is one of fixed depth. In it, a parent at any level has the same number of child levels below it or parent levels above it. An example of this is a calendar. The ragged hierarchy is an important reporting concept. A ragged hierarchy is a structure in which an entry at any level can have different levels of children below it or parents above it. One entry could have seven levels below it, another only five. In any hierarchy, the levels could be named or not. In a named hierarchy, all levels have a designated level number and description. Say for example that the levels are Product, Product Sub Group and Product Group. Each entry is assigned either a Product level name, a Subgroup name or a Group name in addition to the name of the entry at each level.
Technically, this ragged structure does not have to be a hierarchy. A hierarchy is a graded or ranked series. An example is a government hierarchy. The relationship of members in a hierarchy is often called parent-to-child. The typical parent-child relationship in a hierarchy is 1:M - a parent can have many children, but a child can have only one parent. However, in many business requirements the relationship is a network. In a ragged network the relationship across members is M:M and the hierarchy is of indeterminate depth. A parent can have many children and a child can have many parents. A bill-of-materials is usually a ragged (operational) hierarchy.
In reporting systems, such as a data warehouse, the two most difficult ragged hierarchies are Product and Organization. Usually, they are ragged networks. Because the common term for these is ragged hierarchy, however, we will use the term ragged hierarchy even to refer to this M:M relationship, whether it is a pure hierarchy or a network. This article is about using bridges for reporting structures not operational or master data structures such as a product bill-of-materials. Product and organization usually are a complex reporting structure and are among the most difficult to model.
There are four design solutions to M:M relationship across dimensions or between fact and dimensions that are possible:
- Array or series,
- Bridge table,
- Groupings or
- Fixed levels.
An array, as used in this context, is a fixed series of data elements. It is fixed because it has a specific number of entries. The array represents essentially a (conscious) violation of first normal form. Once data is organized into an array within the dimension, the relationship between dimension and fact can be 1:M. The multivalued dimension capability is preserved in the array. The array could be a series of data columns, indicators (sometimes called Booleans), strings or foreign keys. Usually the data elements in an array are arranged in a series and a group number is assigned to the entire array. The different data elements are supported as separate columns. In the bridge tables, they are supported as different rows. Figure 8 shows examples with indicators and data columns. (This example is modified from http://www.pythian.com/blogs/364.)
Pros and Cons
This solution works best when the number of entries in the array is few and fixed. It is possible (but inadvisable) that some entries in the array be empty, but there needs to be a maximum number. It is better if the number of entries in the array is fixed. It is also best if the array is both retrieved and used as a group. This structure is not flexible. It is much harder to query this structure when searching for a particular dimension value because you have to traverse the array to find it. This is particularly true if the array is one long comma-delimited string. The example in Figure 8 example involves program selections for cable TV. However, if the number of selections could be infinite (or at least indefinite), this solution is impractical. An array would not work for the policies within a customer. A customer can have different policies and a policy can relate to different customers. One auto policy could have only one customer, another health policy could have hundreds of participants.
Bridge tables represent the other extreme of flexibility. Bridging involves creating an association between the participating entities that allows you to navigate from one instance to another. The associations are kept as rows. If two separate dimensions are being bridged, then the bridging takes the form of a conventional associative entity. If the association is between one instance of a given dimension and other instances of the same dimension, then this is usually achieved by means of a recursion. Figure 9 is an example involving separate entities.
In a recursion, there are two essential entities: one is the primary entity, which represents the entries out of the context of the bridge. The second is the associative entity, the bridge table itself, which contains the relationships among primary members. The creation of the associative entity or bridge table is the first choice that traditional data modelers would make. It is a viable solution. However, it too has pros and cons and should not necessarily be the primary or only solution evaluated. Figure 10 is an example relating parent and child or subsidiary customers with a twist.
The common way decendent tables are populated is with a parent and its immediate children. An alternate way to populate is with a parent and all of its children - immediate, intermediate and ultimate. This solution is sometimes called a speed table, but a better name is descendent tables. Ralph Kimball has used the above example (but not the term descendent tables) to illustrate this.
Two Types of Bridges
As Joe Celko has pointed out, there are two types of recursion (or self-references), which he calls the Adjacency List and Nested Sets. An adjacency list is the classical (network) recursion, described above. Nested sets, illustrated in Figure 11, take a different approach. Nested sets also consist of a set of pointers, called the left and right pointer, added to each instance in the hierarchy. Each pointer points to the next entry in the hierarchy. The left point points to the next forward entry; the right pointer points to the next backward entry. Using these pointers one can quickly (using only vanilla SQL) navigate forward or backward through the hierarchy. Figure 11 is from Gijs Van Tilder.
The adjacency list or typical recursion has one major advantage and several disadvantages. Its advantage is its flexibility in supporting a ragged hierarchy or network. It is a very elegant and simple solution that works with many applications. Its two major disadvantages are performance and ease of coding. Recent database management system (DBMS) improvements have addressed both issues, however. This structure can now be coded in a single SQL statement in most DBMSs, significantly improving performance. Nested sets have the advantage of high performance and ease of coding. They only support a pure hierarchy (i.e., M:1); and maintenance of large hierarchies can be burdensome if not horrendous, because the entire hierarchy will have to be rebuilt when a change is introduced. In a data warehouse with a populous hierarchy, frequent changes and a short batch window, nested sets would be impractical. This is also true for a manufacturing application with a large number of parts and frequent additions or changes to the product structures. The sports channel example in Figure 12 illustrates this.
Bridging tables are formed as an associative entity (table). They represent a very flexible structure that is easy to extend. It is also inherently dense, meaning rows need exist only where intersections or instances exist. It is easy to maintain and extend. It is also a more natural data modeling structure, because it involves no particular data modeling constraints or restrictions such as the fixed and few numbers of an array. Associative tables are very populous. They have one row for every association. For some, it may be harder to understand, though this is somewhat subjective. Coding associative tables is more difficult than the other methods, because it involves retrieving many rows and navigating those instances. Because of all the rows, performance can be an issue as well.
Ensuring valid entries is another matter, however. To ensure valid entries, a separate set of validation tables are necessary. These too should be contained in a separate associative entity (table). None of the solutions necessarily identifies the allowable combinations. The grouping method is the one that comes closest, if it creates predetermined combinations as groups. The associative entity does not provide validation of the correct instances. Neither does the dynamic grouping method. Each of these methods records the instances that have been grouped, not the ones that are valid to group together. To define what is allowable, you can create another associative table consisting of allowable combinations.
When there is a M:M relationship from fact to dimensions, the issue becomes how to allocate the measure from the fact to the dimensions. The fact is at a higher granularity than the association of dimensions. This is true for arrays, bridge and descendent tables. In most cases, a weighting factor will have to be used. The weighting factor is a number that is used to determine how to allocate the revenue or volumes involved in the hierarchy. The issue is where to place the weighting factor is the challenge. There is no one answer that is the best in any case. Each has its pros and cons.
So, where does one place the weighting factor? Any time there is a question about where to put a given attribute in any model, the first choice is always place it based in its functional dependency. Another way is to ask what does the weighting factor describe? Practically speaking, the weighting factor is put in the array if that solution is used. If bridging or grouping are used, it is usually stored redundantly in every instance of the bridge or group.
A group is a set of dimensions that are formed together to achieve a common purpose. A group can also be viewed as a collection of valid combinations. Each specific collection is given a group number, and the group number is assigned to the other entity (the table) as a foreign key. These can be preplanned groupings or dynamically assembled groupings. The disadvantage is that the code to support these groupings is not simple. In Figure 13 it is possible to flatten Sports Group, Sports Group-Sports and Sports Dimension. The relationship is still supported as a M:M. See Jeff Prevosts article in this issue of DM Directfor a full description of this method.
Reduction to Fixed Hierarchy
In a fixed solution, the M:M association is reduced to a 1:M. This is clearly a restriction of the business requirement. For example, say your product hierarchy could potentially include Product, Product Sub Group and Product Group, in which a Product could appear in multiple Sub Groups and a Sub Group in many Groups. If management decides that for the short or medium term, it is sufficient to manage this as a M:1, a Product belongs to one Sub Group and a Sub Group to one Group. Figure 14 is a simplified example of a recursive (or M:M) product reporting hierarchy.
Figure 15 is the same recursive hierarchy reduced to a fixed hierarchy.
One issue is that this solution artificially reduces requirement to fixed levels. Another is that when snowflaking ragged hierarchies, the skipped levels have to appear as null levels or the higher level may have to be repeated. The methods of flagging skipped levels or duplicating the higher level are also used when the hierarchy is flattened.
This article covered four ways to support the many-to-many relationship between dimensions or between a fact and its dimensions, namely, an array, a bridging table, groups and as a fixed hierarchy. They are not all equal, or as is sometimes said, some are more equal than others. Some are easier to maintain than others. Some are easier to query than others. Some are more flexible. You have to understand your requirement to make the right choice.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access