Murphy’s Law tells us that whatever can go wrong will go wrong. This axiom applies to all aspects of life, including data warehousing. The following corollaries to Murphy’s Law relate this inevitability of something going “bump” in the night to data warehousing.

 

Your data warehouse has been extracting data from a source system. The extract, transform and load (ETL) from the source system to the data warehouse is stable. Everything seems to be working correctly, and then:

  1. A rose by any other name - The source system decides to change the meaning of a data element that you are extracting on a daily basis. The structure and form of the data are the same, but the meaning has changed. If the extract, transform and load application enforces relational integrity, you may know about this source data change before a data warehouse customer tells you about it.
  2. Decoy - The source system decides to switch to a new operational application with its new sets of data. For back-out and backward compatibility purposes, the old operational application and its data are allowed to continue their existence. Your only clue is that data volumes from the source system decrease. If the ETL application monitors its data volumes, you may be aware of the change before a data warehouse customer tells you about it.
  3. New caboose - The source system adds new processes to the end of the job stream. The new processes mean the data you are extracting, which had previously been in its final state, is now in an intermediate state. The caboose of the train is no longer the caboose, or at least, not the last car. If the ETL application monitors the quality of its data, you might observe the change before a data warehouse customer tells you about it.
  4. Data quality - The source system blows a gasket. The quality of the data goes through the floor. If the ETL application is monitoring the quality of its data, you may have been able to see that the source system has a problem. In this situation, the data warehouse is the least of the system’s problems. Even though you’ve exercised the due diligence to identify the presence of poor quality data, the source system is not able to provide high-quality data. In this conundrum, the data warehouse may opt to exclude erroneous data from a source system until that source system can clean up its own data.
  5. Data volume - The source system dramatically increases its data volume. The cause of the sudden increase in data volume can be anything from true expansion to corporate reorganization. Regardless of the cause, the effect is a significantly higher number of records passing through the ETL application, causing data files to exceed their allocations and the data warehouse to miss its service level agreement (SLA). If the ETL application monitors its data volumes, you may notice the change before a data warehouse customer asks you why the data warehouse failed to meet its SLA.
  6. Higher power - Upper management decides the data warehouse should include a subject area(s) specifically in support of a new high-profile, high-ROI and politically charged application. The data will be extracted, stored and reported to meet the specific requirements of the new operational application. This is when a data warehouse is a victim of its own success. The reason upper management makes such a decision is because the data in the data warehouse is expected to increase the value and ROI of the new operational application. The best defenses against such attempts to operationalize a data warehouse are:
    • A complete and comprehensive enterprise data model, so that all subject areas of the enterprise have a home;
    • An understanding of and commitment to the data warehousing principles established by Ralph Kimball and Bill Inmon; and
    • Knowledge of available data warehousing architectures that simultaneously follow data warehousing principles and provide data to operational applications in order to meet specific requirements.
  7. Wrong question - Data warehouse customers try to make business decisions by asking a question of the data warehouse and then finding the answer using the data in the data warehouse. But if they don’t understand how to formulate their question or how to find the answer, they often blame the data warehouse. The best remedy for such misunderstanding is metadata. Metadata provides descriptions and definitions of the data in a data warehouse. Delivered side-by-side with the data, metadata gives data warehouse customers the information they need to correctly formulate questions and answers.
  8. Attractive nuiscance - You find out that a table or view, which was descoped by a project a year ago, has been erroneously allowed to continue its existence in the data warehouse. To make matters worse, data warehouse customers and operational applications found this table or view and are actively using it. As a result, you have stale and obsolete data that is being used actively. The best defense against such mishaps is the dynamic metadata, which provides the activity (i.e., number of rows loaded, date loaded and row count) of that table. Metadata providing statistics about individual tables, would lead you to wonder why there is a table that is queried but never loaded.
  9. Locksmith -The key of the source data changes. Usually when this occurs, a new data field has been added to the key. Suddenly, the changed data capture process of the ETL application misinterprets source data. If the ETL application monitors its data volumes, you may have noticed the change in the number of records the ETL application applies to the data warehouse.
  10. Didn’t I Tell You? - You find out that a logic gap exists in an ETL application. This logic gap is causing the ETL application to misstate the data it is loading into the data warehouse. Such a logic gap can be the result of new logic recently added to an operational source system, or it may have existed from the beginning of the ETL application and only recently been revealed by a new set of conditions. The number of rows has not changed. The keys of the data have not changed. It could be as simple as the application of a new sales tax or profit margin. The best defense against such logic gaps is a good working relationship with a data steward who knows the subject area and can recognize such errors.

It is rather easy to create an ETL application that trusts the world to be a perfect place wherein nothing goes wrong. A robust ETL application, however, includes metadata, monitors its own data volumes and assesses data quality. A robust data warehouse is built on a broad foundation provided by a comprehensive enterprise data model and data warehousing principles. These methods that enhance the quality and viability of a data warehouse can be a tough sell when allocating personnel, funds and other resources to the creation of a data warehouse. That is, until you understand the world is not a perfect place and anything that can go wrong, will go wrong - even in a data warehouse.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access