I have a small problem relating to the relationships that exist between a dimension table and a fact table. There are two kinds of relationships: identifying and non- identifying. I do understand the meaning of both the relationships but am facing some difficulty in applying it. I would like to know which kind of relationship can best handle the slowly changing dimensions.
I am developing a model for a telecommunications billing data mart. In this there are various attributes in the customer dimension that can change over a period of time and change in different manner. I am handling them with Type 2 Slowly Changing Dimension. In the customer dimension if there is a change at the address level, a new record will be inserted with a new key. How do we handle this change in the invoice fact because invoice related details are not changing for that customer? Kindly suggest the kind of relationship that should exist between the customer dimension and the invoice fact.
Les Barbusinski’s Answer: The relationship between a fact and dimension is almost always an identifying relationship unless you’re dealing with an FK from a "snowflake" dimension. But I don’t think that’s your problem.
Usually, when you have a slowly changing dimension, it’s a good idea to separate the static data from the dynamic data. Hence, in the case of a Customer, you’d have a Customer Master dimension and a Customer History dimension...with a parent-child relationship between the two. The former would contain information about the customer that does not change over time (i.e., name, gender, birth date, SSN, Customer ID, etc.), while the latter would contain slowly changing information about the customer (e.g., address, occupation, marital status, income level, etc.). All related fact tables would then have an "identifying" FK relationship to the Customer History dimension. Derived metrics that measure Customer activity over time (such as invoice metrics) can then simply join the applicable Customer History row with its parent row in the Customer Master dimension, and aggregate as needed. You can even create a view that joins the two dimension tables to fool the BI tool’s semantic layer into thinking that a particular fact is linked to the parent Customer Master dimension. Hope this helps.
Scott Howard’s Answer: In most cases you will have many identifying relationships in the star schema because you’ll need to resolve the dimension keys contained in the fact table to fully identify the dimensional detail related to the fact’s measures. It is a given that you will seldom use non- identifying relationships in the star or snowflake schema between the fact and its dimensions. The only place that non-identifying relationship may appear is within the dimensional hierarchy or in dimension to dimension relationships. So the answer your first question about which is better for slow changing dimensions would be, you don’t have a choice.
Now your problem appears to be in handling slow varying dimensions: as the dimension’s data changes, how do we maintain synchronization with related facts? The first thing we need to determine is if it’s even appropriate to maintain those relationships. A customer change should have no effect on prior facts as those facts should always relate to the temporal view of the customer at the time of the fact. Ask the users, do you need to analyze customer purchases based on the characteristics of the customer and the time of purchase or based solely on current customer characteristics? We would deduce that it must always be based on the view at time of purchase because any other view would be an anachronism or an invalid temporal reference. The interesting problem here is many users want both views allowing them to analyze across alternate realities. This is very abstract as you can see and takes several hours and exercises to fully develop a solution in my workshops. Let me try to do some justice to the subject without over stressing DM Review’s storage system.
You have two basic choices both still involving identifying relationships. You can model the relationships to join solely on the natural key, but this would require complex temporal joins at analysis time to prevent anachronisms. In other words, the query tool or user must be smart enough to include time references for both the fact and associated dimensions in the query an not rely exclusively on the keys. This is the most flexible approach because it allows the sophisticated user to create almost any alternate reality. It’s also the most dangerous because it’s easy for the non-sophisticated user to base business decisions on things that never happened anachronisms.
The second method of identifying these temporally complex relationships is to use historical keys. As the dimensions change, the new facts must ensure that they always inherit the most current dimensional key of the slow varying dimension. This makes it simple to ensure that facts always identify the temporally appropriate dimension occurrence. However, this approach also makes is more difficult to model alternate realities. There are many more complexities to discuss and develop including path lookups and intentional breaks in the dimensional hierarchy but difficult in the forum.
Mike Jennings’ Answer: This is typically handled through an identifying relationship between the customer dimension and the invoice fact table with a one-to-many parent-to-child cardinality. The customer dimension table uses an integer surrogate key field as its primary key. This customer surrogate key field becomes a foreign key in the invoice fact table. Using slowly changing dimensions type 2, during any subsequent load cycles to the data warehouse where a column value changes for the production/natural key (Mary Smith in Figure 1) a new record is added to the customer dimension table to capture this information (see Figure 1). If no new invoice related measures are available for this customer during this load cycle, the change in the customer dimension can be skipped and picked up in subsequent load cycles (maintaining 1:M).
Figure 1: Example
Chuck Kelley’s Answer: There are two (at least) questions that I would have for you. First, how important is the question that where someone lives affects the manner of the use of telecommunication services that you provided? Second, do you really want to change the fact that an invoice was sent to a certain address or are you merely interested in the current address that the invoice should (or might have) been sent to. Since you are using Type 2 slowly changing dimensions (SCD), I assume that you want to deal with history. That being the case, if there are no invoices shipped to the new address, then the new address will not have any facts (yet) associated to it. Remember that since the data warehouse is historical (and many times quite hysterical), the Type 2 SCD is used to track the changes. It might be possible that there would be zero invoices for a specific address, but that would be OK, since you want to track the history of the customers address.
As for identifying or non-identifying relationships, I believe that 99 percent (including this one) should be an identifying relationship if you are using a star schema, as you seem to be.
Joe Oates’ Answer: The important factor here is that you must have a unique customer number that does not change over time. The primary key in a slowly changing dimension such as customer changes every time that you create a new record. Therefore, you must have an additional identifier that does not change when you create a new record. See the following link for an in-depth discussion:
My personal preference is to remove things that have a high probability of changing over time, such as demographics and address, from the customer dimension and put these into one or more dimension fact tables and treat them as helper or bridge tables. I then insert the primary key of these into fact tables where appropriate. See Kimball’s Data Warehouse Toolkit Second Edition page 164 for an example. I can then treat the customer dimension as a Type 1 dimension. For things like name or other customer life cycle events, I also create a separate fact table.
This approach adds a little complexity to the ETL for the customer dimension, but saves a lot of messy navigation issues. Of course, everything in data warehousing is a trade off and there is no single solution that will be the least effort and provide the best performance for all situations, but this has worked well for me for many years. An online discussion at http://aln.hha.dk/IFI/Hdi/2000/Download/Lektion24.doc alludes to such an approach.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access