Published
• October 01 2004, 1:00am EDT
More in

In the July 2004 issue of DM Review, I discussed the concept of dimensional placeholders. I mentioned that you should not allow a null in any part of the fact table key. This month, I would like to provide an example demonstrating how this can cause problems that are rather difficult to debug. To illustrate the problem, let's look at two dimensions on a fact table. In reality, you will have more than two dimensions, and, likewise, the problem can become much more complex. Consider the simple sales example in Figure 1.

Figure 1: Simple Sales Example

Let's also look at data content for each of the tables as shown in Figures 2-4.

Figure 2: The Date Dimension Rows

Figure 3: The Product Dimension Rows

Figure 4: The Sales Fact Table Rows

There are 21 rows in the fact table. Let's walk through several queries to understand how these tables work together.

Query 1: Get the grand total of all rows in the fact table.
SELECT Sum(SALES_FACTS.Dollars_Sold) as [Dollars_Sold]
FROM SALES_FACTS;

Figure 5:Query 1 Results

Query 2: Include the date dimension in the query.
SELECT Sum(SALES_FACTS.Dollars_Sold) as [Dollars_Sold]
FROM DATE_DIM,
SALES_FACTS
WHERE DATE_DIM.DATE_KEY = SALES_FACTS.DATE_KEY;

Figure 6: Query 2 Results

Query 3: Get all sales grouped by quarter.
SELECT DATE_DIM.Quarter,
Sum(SALES_FACTS.Dollars_Sold) as [Dollars_Sold]
FROM DATE_DIM,
SALES_FACTS
WHERE DATE_DIM.DATE_KEY=SALES_FACTS.DATE_KEY
GROUP BY DATE_DIM.Quarter;

Figure 7: Query 3 Results

Note that the grand total from query 3 is still 4,200.

Query 4: Constrain query 3 to provide only data from the third quarter and group the results by category (from the product dimension).
SELECT DATE_DIM.Quarter,
PROD_DIM.Category,
Sum(SALES_FACTS.Dollars_Sold) as [Dollars_Sold]
FROM DATE_DIM,
PROD_DIM,
SALES_FACTS
WHERE DATE_DIM.DATE_KEY=SALES_FACTS.DATE_KEY AND
PROD_DIM.PROD_KEY=SALES_FACTS.PROD_KEY AND
DATE_DIM.Quarter="2004 Q3"
GROUP BY DATE_DIM.Quarter,
PROD_DIM.Category;

Figure 8: Query 4 Results

Note that the total results for the third quarter are 700, which is consistent with the results from query 3.

Query 5: Now let's remove the constraint to see only the third quarter in order to see the results for all sales grouped by quarter and category.
SELECT DATE_DIM.Quarter,
PROD_DIM.Category,
Sum(SALES_FACTS.Dollars_Sold) as [Dollar_Sales]
FROM DATE_DIM,
PROD_DIM,
SALES_FACTS
WHERE DATE_DIM.DATE_KEY = SALES_FACTS.DATE_KEY AND
PROD_DIM.PROD_KEY = SALES_FACTS.PROD_KEY
GROUP BY DATE_DIM.Quarter,
PROD_DIM.Category;

Figure 9: Query 5 Results

We have a problem! The grand total from this query is 3,900! There are no constraints; therefore, you would expect that this should represent all rows, but we are missing \$300 from the result of 4,200 that we got from both query 1 and query 2! Why?

Query 6: To figure out the differences, let's see which rows are being included in query 5. To do this, eliminate the sum and the group by clause from query 5. Because I have also included the DIM_KEY and the PROD_KEY in the results, we know which exact fact table rows are included.
SELECT DATE_DIM.Quarter,
PROD_DIM.Category,
SALES_FACTS.DATE_KEY,
SALES_FACTS.PROD_KEY,
SALES_FACTS.Dollars_Sold
FROM DATE_DIM,
PROD_DIM,
SALES_FACTS
WHERE DATE_DIM.DATE_KEY = SALES_FACTS.DATE_KEY AND
PROD_DIM.PROD_KEY = SALES_FACTS.PROD_KEY;

Figure 10: Query 6 Results

Only 20 rows are returned, and we know that the fact table has 21 rows. The row that is missing is the row with the DATE_KEY of 3 and the PROD_KEY is null. Because the query is performing an inner join to the product dimension, the row in the fact table that contains the null in the PROD_KEY field is not included in the result set. When you look at what is happening, the database is working properly.

The next thought you may have is to include a row in the product dimension table that has a null. You might think that this would prevent the problem from happening. However, nulls are not included in join processing. The same results will occur.

Thus, you need to create a row in the product dimension that has a surrogate key value to indicate the "Unknown" product. Use this surrogate key in the fact table as well. Now you have created a placeholder, as I recommended in my July column.