In a simple star schema design for a data mart, does it matter if the relationships from the dimension tables to the fact tables are indirect relationships rather than direct relationships? What are the impacts pro and con?


Scott Howard’s Answer: This is more of a physical design rather than a logical design issue. Some RDBMSs require a direct relationship in support of their internal star join optimization techniques, while other vendors work fine with indirect relationships. That's a physical design issue. When designing the logical model, I seldom give a hoot about the physical design because I'm trying to design a model more aligned with the business need rather than the vendor RDBMS' limitations. If my logical design, that best supports the business requirements, needs to be compromised in order to support some vendor's RDBMS requirements or limitations, it's time to look for another vendor. After all, are we trying to appease the business customer/user or the RDBMS?

Joe Oates’ Answer: This is very similar to a question from last month. I will repeat the parts of that answer that pertain to your specific question, as I understand it.

One of the main reasons that dimension foreign keys are inserted into the original transaction record to produce a data warehouse fact table that has more information than the original transaction (called "enriching the transaction") is to provide acceptable performance. Most of my experience has been with very large databases (VLDBs). One of the principles to get acceptable performance in a VLDB environment is to avoid any join possible. The scenario that is suggested by the question would increase the number of joins required to give an answer.

I will relate an experience with just such a situation. A company had invested millions of dollars in a data warehouse that used designed principles like that suggested in the question. There were several fact tables with more than four billion rows. There were 40 million customers, many with more than one address.

The company was not happy with the data warehouse because even simple queries ran for hours. Several reports took days to run. When our team was brought in to redesign the data warehouse, we came up with a multidimensional design that imbedded the dimension foreign keys directly into the fact tables. We also replaced natural keys with a single surrogate key column. The result was that the reports that had formerly taken days to run, ran in just a few minutes. I knew that performance would be much better than the previous design, but I was really surprised at how much better the performance actually turned out to be.

In a nutshell, unless you have only a trivial amount of data, taking the dimension foreign keys out of the fact table and putting them into an associative table between the fact table and the dimension tables, e.g., an indirect relationship is not a good idea.

Clay Rehm’s Answer: For performance reasons, I would use direct relationships. The identifiers must be indexed, and those indexes must be utilized in the access path.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access