Continue in 2 seconds

I've recently hired on to administer and manage a previously designed data warehouse.

By
  • Chuck Kelley, Clay Rehm, Les Barbusinski
Published
  • May 07 2004, 1:00am EDT

Q:  

I've recently hired on to administer and manage a previously designed data warehouse. To my surprise, the active physical warehouse tables have no primary or foreign key constraints. I'm told the data is prevalidated during upstream ETL processing, and constraints merely cause performance issues during loading of the star schema. Is this a valid point? In my experience, database constraints are always part of a warehouse design and form the crux of fact-dimension relationships.

A:  

Les Barbusinski's Answer: Actually, this is fairly common practice ... especially in large data warehouses. PK and FK constraints require a lot of overhead and can cause a serious "drag" on the performance of most ETL scripts. The danger, of course, is that one or more ETL scripts fail to meticulously enforce PK and FK rules (especially after a modification has been applied), causing the tables to become hopelessly out of sync within a matter of days. You can mitigate this risk by developing scripts that periodically test the PK and FK constraint rules on crucial tables and send alerts via e-mail when discrepancies are detected. In smaller data warehouses or marts, however, it's safer and easier to just let the RDBMS manage the constraints. Hope this helps.

Chuck Kelley's Answer: Welcome to the world of data warehousing. In the early days, we didn't like the constraints because they were so slow in getting the data loaded within the time window. We made sure that everything was built properly. During development, we would have all the constraints, but during production, they came off. This is really no different than what we did/do with transaction systems. So, yes, that is valid.

However, in some database systems, in order to get star schema optimization, you need to put them on. For example, Oracle has a optimization strategy that understands how to accurately join a star schema. Without the primary key/foreign keys (PK/FK) defined, it will not work. So, a validation that the star schema optimization is more useful versus the costs of PK/FK constraints needs to be done for your environment. In the early days, the biggest design decision was based on load times. Now, that is not so much the case. You just need to test and find out which is best for you.

Clay Rehm's Answer: In my experience, database constraints do tend to cause performance issues in specific cases. You indicated that you were told that the data is prevalidated during the ETL processing. Can you get your team to show you physical proof of this? What kind of data quality reporting is currently being done to provide the proof that you need?

As the new manager, you will want to have trust and mutual respect between you and your team. Explain to them that even that you are asking for proof, it is to help you understand the workings of the data warehouse so you can be a better manager.

Larissa Moss'Answer:There are two schools of thought on the issue of enforcing referential integrity (RI). One insists on RI enforcement through ETL programs; the other insists on using the RI feature of the DBMS. In my own experience, I have seen too many databases become corrupt within a few months, which is why I do not prescribe to the "ETL-only" model. However, keeping RI turned on during database loads does indeed affect load performance, which is why it is commonly not practiced on data warehouses. As an alternative, consider dropping RI for the loads and turning RI on again after the loads. If there are any foreign keys pointing into "outer space" (table goes into a check pending condition), at least you will find out right away that you have a problem, and not months or years later when recovery is all but impossible. If that is still not an acceptable solution because of the time impact, then run SQL scripts after each load to find Foreign Key without parents.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access