The previous article addressed the question How is real time achieved? Beginning with the architecture, it provided a general design for a real-time application. This third and final article of the series addresses the questions What causes the biggest headaches? and What can do you do about it?
It can be frustrating following the implementation of an application that delivers real-time data when the source system refuses to cooperate. This article addresses some of the more common lapses in data quality that can shoot a real-time application in the foot.
Incomplete Within Itself
Problem: Data extracted from an operational source system may lack internal elements that are required to consider the data complete. For example, a record may lack:
- Persons name,
- Part description and/or
- Creation date.
The business meaning of that record may be compromised as a result.
Solution: This is a data quality problem, and it requires a data quality solution. Typically, if the meaning of a record is significantly compromised or rendered unusable, the occurrence of poor data quality is reported and the record rejected (hard reject). However, if some of the records business meaning can be salvaged by filling in the gaps with default values, then the record is allowed to pass to the data warehouse with default values inserted into the gaps. In this case, the occurrence of poor data quality is reported, including the data quality exception encountered and the action that was taken (soft reject).
Incomplete Beyond Itself
Problem: Data extracted from an operational source system may only have a business meaning in the context of other data records. If those other records do not arrive the business meaning can be compromised or removed completely. For example, a record may lack:
- A header record to provide the context for a detail line-item record,
- An original transaction that is updated by the data in a subsequent record and/or
- A dimensional data value that will join to the fact data in a record.
As a result, the business context of that record may be compromised.
Solution: This is also a data quality problem, and it requires a data quality solution. Again, if the context of a record is compromised or rendered unusable, the occurrence is reported and the record is rejected (hard reject). However, the context of the record can be salvaged by filling in the gaps with default values, the record is allowed to pass to the data warehouse with default values. The occurrence of poor data quality and soft reject are reported in this circumstance as well.
Problem: Sometimes a source system will repeat its data. A receiving agent or application may request that data be played again. An interruption in an asynchronous environment may cause data to be repeated when connectivity is restored.
Solution: If a record should never be repeated, then each record must be keyed to allow a staging application to recognize it when repeated. This may not be as straightforward as it sounds. The repeat record may have a timestamp value, which is different from the original data record. In that case, the timestamp is not part of the key that uniquely identifies a data record. The key that uniquely identifies each data record is the key to catching repeated data. It is important to determine what the key is.
Repeat Data with Modifications
Problem: Sometimes a source system will realize it has made a mistake and try to resolve the error by restating a data record. This can be caused by internal completeness or external relational integrity issues that have been resolved. Now that the data is complete, the source system replaces the incomplete record with a complete record.
Solution: Each record must be keyed to allow a staging application to recognize it when repeated, even if it is repeated with modifications. The key which uniquely identifies each data record is the solution to catching modified data. If one of the values in the key of the repeated modified record has been modified, the subsequent record is a different record by key definition, which renders the extract, transform and load (ETL) application blind to the repeat. If, however, it was an attribute that was modified, then the ETL application must be able to recognize the repeated record as an update to a previous record.
A popular design consideration is a recycle wheel. Data records with a data quality problem are held inactively in a recycle wheel until they can be resolved and corrected via the source system. The problem with a recycle wheel is that data quality problems can quickly become forgotten and ignored. Such inattention to data quality is caused by significant data volumes. For example, if 20 million data records occur in a day and 10 of them have a data quality problem, a subject matter expert from the source is not likely to be concerned with those 10 data records. When this situation is repeated daily for years, the recycle wheel will most likely exceed it storage allocation; meanwhile, its not delivering data or value to the data warehouse.
In this series, I have addressed questions that must be answered as the decision-makers in an organization or enterprise consider the possibility of real-time data warehousing:
- What is real time?
- When and where is real time valuable?
- How is real time achieved?
- How do you trouble shooting in real time?
This information is, of course, generic. No two real-time applications will implement these concepts and methods in exactly the same way. But, this series provides an approach and set of concepts that will enhance a real-time data warehouse.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access