Today businesses are investing many resources in building data warehouses and data marts to obtain timely and actionable information that will give them better business insight. This will enable them to achieve, among other things, sustainable competitive advantage, increased revenues and a better bottom line.
In the early '90s, data warehousing applications were either strategic or tactical in nature. Trending and detecting patterns was the typical focus of many solutions. Now, companies are implementing data warehouses or operational data stores which meet both strategic and operational needs. The business need for these solutions usually comes from the desire to make near real-time actions in a constantly changing environment while receiving information from both internal as well as external source systems.
Dealing with missing or unknown data is critical in these types of environments. Unknowns skew metrics and results to produce incorrect decisions. Knowledge of the unknown allows at least for further examination of any conclusions drawn from incomplete data. Furthermore, in a well-designed business intelligence environment, these unknowns are often resolved later as data that is more complete is entered into the operational systems. Irrespective of the nature of the applications, missing information has always been a problem for data warehouses. As business intelligence environments become more mature, real time and mission critical, the increased number of operational applications accentuates this problem.
It is important to keep in mind the time-specific value of information. As General George S. Patton once said, "A good plan, violently executed now, is better than a perfect plan executed next week." To form a good plan, one needs adequate information now, rather than perfect and correct information later. However, depending on the organizational value, businesses will likely want both the available incomplete information now and the correct information later. Data warehouse applications must be designed to accept incomplete data and have the capability to make changes with time. Where it is not possible to fix the data, consider assigning values intelligently.
Missing Dimension Records
While loading a fact record, we typically look up the appropriate dimension row associated with that fact based on the business key to obtain the surrogate key. We then populate the fact table with appropriate dimension foreign keys. If the dimension surrogate key is not found, we have a case of a missing dimension record. There are several options to deal with missing dimension records.
Insert Bare- Minimum Dimension
One option is to insert the missing dimension record as a new dimension row using just the business key and assigning default values that mean "unknown" to other attributes in the dimension. The best-case scenario would be receiving an updated dimension record that would later be matched to the business key. The updated record would have all the appropriate values to make the dimension row meaningful. This approach allows the fact to be loaded with the correct foreign key by marking the dimension row as incomplete and allowing future updates to complete the dimensional row. This should be done as a separate pre-process before the fact-table load so that the dimension is present at the time of the fact-table load. Insertion of the dimension record in the dimension table may be difficult if the business key itself is missing. In this case, it is best to assign the fact to a dimension row that represents a dummy value.
Reprocess Fact Records
A second option would be to insert the fact record with the missing dimension foreign keys referring to a dummy dimension record. This will allow the fact to be loaded with missing information. The metrics will be correct at the gross level but they will be off at the dimension level due to the missing information. In this scenario, the incoming fact records are stored for reprocessing and flagged as incomplete. These facts are then cycled back as input the next time the fact table is built. One can try reprocessing the fact record until there is no more missing information.
When correcting the missing information, consider the following questions.
- Does the fact record only come from the source system once?
- Is there a possibility that a row will become associated with a dummy dimension record because of edits?
- Can changes or fixes to related dimensions result in the reassignment of dummy foreign keys to correct values?
If the answer to all three of these questions is yes, then reprocessing may be required. The final test is to balance the cost of development versus the benefits of implementation. Is this type of error likely to occur, and is ultimately obtaining the correct information of value to the organization? If yes, then reprocess the fact record.
If you have determined that reprocessing is required, you may either update the fact record or back out the original fact record with an offset entry followed by the insertion of the correct fact record.
Intelligent Assignment of Dummies
Occasionally, no amount of reprocessing will help. The information will never be correct and it will always be missing some dimension values. From a user-analysis perspective, it is important that we try to look at the context of the missing information and assign different dummy values. This extra level of information gives the users better confidence in the data. Keep in mind that sometimes it is possible to look at the context of the data and guess the correct value.
Resolving missing dimensional values either through the use of unknowns and subsequent updates or through the use of a dummy dimension entry is preferable to the alternative which is dropping the data due to referential integrity constraints. In the best case, the missing data will be resolved over time; in the worst case, the fact data is flagged as incomplete, and this fact may be taken into consideration when drawing conclusions based on the data as a whole.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access