Continue in 2 seconds

An Architecture for Data Quality

  • October 01 2007, 1:00am EDT

In this article, I propose a comprehensive architecture for capturing data quality events as well as measuring and ultimately controlling data quality in the data warehouse. This scalable architecture can be added to existing data warehouse and data integration environments with minimal impact and relatively little upfront investment. Using this architecture, it is even possible to progress systematically toward a Six Sigma level of quality management. This design is in response to the current lack of a published, coherent architecture for addressing data quality issues.

Three powerful forces have converged to put data quality concerns near the top of the list for organization executives. First, the long-term cultural trend that says, "If only I could see the data, then I could manage my business better" continues to grow. Most knowledge workers believe instinctively that data is a crucial requirement for them to function in their jobs. Second, most organizations understand that they are profoundly distributed, typically around the world, and that effectively integrating myriad disparate data sources is required. And third, the sharply increased demands for compliance mean that careless handling of data is not going to be overlooked or excused.

These powerful converging forces illuminate data quality problems in a harsh light. Fortunately, the big pressures are coming from the business users, not just from IT. The business users have become aware that data quality is a serious and expensive problem. Thus, the organization is more likely to support initiatives to improve data quality. But most business users probably have no idea where data quality problems originate or what an organization can do to improve data quality. They may think that data quality is a simple execution problem in IT. In this environment, IT needs to be agile and proactive: data quality cannot be improved by IT alone. An even more extreme view says that data quality has almost nothing to do with IT.

It is tempting to blame the original source of data for any and all errors that appear downstream. If only the data entry clerk were more careful and really cared! We are only slightly more forgiving of typing-challenged salespeople who enter customer and product information into their order forms. Perhaps we can fix data quality problems by imposing better constraints on the data entry user interfaces. This approach provides a hint of how to think about fixing data quality, but we must take a much larger view before pouncing on technical solutions. At a large retail bank I worked with, the Social Security number fields for customers were often blank or filled with garbage. Someone came up with the brilliant idea to require input in the 999-99-9999 format, and to cleverly disallow nonsensical entries such as all 9s. What happened? The data entry clerks were forced to supply valid Social Security numbers in order to progress to the next screen, so when they didn't have the customer's number, they typed in their own!

Michael Hammer, in his revolutionary book Reengineering the Corporation published in the early 1990s, struck at the heart of the data quality problem with a brilliant insight that I have carried with me throughout my career. Paraphrasing Hammer, seemingly small data quality issues are, in reality, important indications of broken business processes. Not only does this insight correctly focus our attention on the source of data quality problems, but it also shows us the way to the solution.

Establish a Quality Culture and Re-Engineer the Processes

Technical attempts to address data quality will not function unless they are part of an overall quality culture that must come from the very top of an organization. The famous Japanese car manufacturing quality attitude permeates every level of those organizations, and quality is embraced enthusiastically by all levels, from the CEO down to the assembly line worker. To cast this in a data context, imagine a company like a large drugstore chain where a team of buyers contracts with thousands of suppliers to provide the drugstore inventory. The buyers have assistants whose job it is to enter the detailed descriptions of everything purchased by the buyers. These descriptions contain dozens of attributes. But the problem is that the assistants have a deadly job. They are judged on how many items they enter per hour. The assistants have almost no visibility about who uses their data. Occasionally the assistants are scolded for obvious errors. But more insidiously, the data given to the assistants is itself incomplete and unreliable. For example, there are no formal standards for toxicity ratings, so there is significant variation over time and over product categories for this attribute. How does the drugstore improve data quality? Here is a nine-step template, not only for the drugstore, but for any organization addressing data quality:

  1. Declare a high-level commitment to a data quality culture.
  2. Drive process reengineering at the executive level.
  3. Spend money to improve the data entry environment.
  4. Spend money to improve application integration.
  5. Spend money to change how processes work.
  6. Promote end-to-end team awareness.
  7. Promote interdepartmental cooperation.
  8. Publicly celebrate data quality excellence.
  9. Continuously measure and improve data quality.

At the drugstore, money needs to be spent to improve the data entry system so that it provides the content and choices needed by the buyers' assistants. The company's executives need to assure the buyer's assistants that their work is very important and their efforts affect many decision-makers in a positive way. Diligent efforts by the assistants should be publicly praised and rewarded. And end-to-end team awareness and appreciation of the value of data quality is the final goal.
Once the executive support and the organizational framework are ready, then specific technical solutions are appropriate. The rest of this article describes how to marshal technology to support data quality. Goals for the technology include:

  • Early diagnosis and triage of data quality issues,
  • Specific demands on source systems and integration efforts to supply better data,
  • Specific descriptions of data errors expected to be encountered in extract, transform and load (ETL),
  • Framework for capturing all data quality errors,
  • Framework for precisely measuring data quality metrics over time, and
  • Quality confidence metrics attached to final data.

The Role of Data Profiling

Data profiling is the technical analysis of data to describe its content, consistency and structure. In some sense, any time you perform a SELECT DISTINCT investigative query on a database field, you are doing data profiling. Today there are a variety of tools purpose-built to do powerful data profiling. It probably pays to invest in a tool rather than build your own, because the tools allow many data relationships to be explored easily with simple user interface gestures.

Data profiling plays distinct strategic and tactical roles. At the beginning of a data warehouse project, as soon as a candidate data source is identified, a quick data profiling assessment should be made to provide a go/no-go decision about proceeding with the project. Ideally this strategic assessment should occur within a day or two of identifying a candidate data source. Early disqualification of a data source is a responsible step that will earn you respect from the rest of the team, even if it is bad news. A very late revelation that the data source cannot support the mission can have a fatal career outcome if this revelation occurs months into a project.

Once the basic strategic decision is made to include a data source in the project, a lengthy tactical data profiling effort should be made to identify as many data problems as possible. Issues that show up in this phase result in detailed specifications that either are: 1) sent back to the originator of the data source as requests for improvement; or 2) are the grist for processing in the data warehouse ETL pipeline every time data is extracted from the source. I firmly believe that most problems can only effectively be addressed at the source.

Quality Screens

The heart of the data warehouse ETL architecture is a set of quality screens that act as diagnostic filters in the data flow pipelines. A quality screen is simply a test implemented at any point in the ETL or data migration processes. If the test is successful, nothing happens and the screen has no side effects. But if the test fails, then every screen must:

  • Drop an error event record into the error event schema, and
  • Choose to halt the process, send the offending data into suspension or merely tag the data.

Although all quality screens are architecturally similar, it is convenient to divide them into three types, in ascending order of scope. Here we follow the categorizations of data quality as defined by Jack Olson in his seminal book Data Quality: The Accuracy Dimension: column screens, structure screens and business rule screens.
Column screens test the data within a single column. These are usually simple, somewhat obvious tests, such as testing if a column contains unexpected null values, or if a value falls outside of a prescribed range, or if a value fails to adhere to a required format.

Structure screens test the relationship of data across columns. Two or more fields may be tested to verify that they implement a hierarchy (e.g., a series of many-to-one relationships). Structure screens include testing foreign key/primary key relationships between fields in two tables and also include testing whole blocks of fields to verify that they implement postally valid addresses.

Business rule screens implement more complex tests that do not fit the simpler column or structure screen categories. For example, a customer profile may be tested for a complex time-dependent business rule, such as requiring that a lifetime platinum frequent flyer has been a member for at least five years and has more than 2 million frequent flyer miles. Business rule screens also include aggregate threshold data quality checks, such as checking to see if a statistically improbable number of MRI examinations have been ordered for minor diagnoses. In this case, the screen only throws an error after a threshold of such MRI exams is reached.

Error Event Schema

The error event schema is a centralized dimensional schema whose purpose is to record every error event thrown by a quality screen anywhere. This approach obviously can be used in a general data integration application, where data is being transferred between legacy applications. The error event schema is shown in Figure 1.

Figure 1: The Error Event Schema

The main table is the error event fact table. Its grain is every error thrown by a quality screen anywhere in the ETL or data migration system. Remember that the grain of a fact table is the physical description of why a fact table record exists. Thus, every quality screen error produces exactly one record in this table, and every record in the table corresponds to an observed error.

The dimensions of the error event fact table include the calendar date of the error, the batch job in which the error occurred and the screen which produced the error. The calendar date is not a minute and second timestamp of the error; rather, it provides a way to constrain and summarize error events by the usual attributes of the calendar, such as weekday or last day of a fiscal period. The time-of-day fact is a full relational date/timestamp that specifies precisely when the error occurred. This format is useful for calculating the time interval between error events because one can take the difference between two date/time stamps to get the number of seconds separating events.

The batch dimension can be generalized to be a processing step in cases where data is streamed, rather than batched. The screen dimension identifies precisely what the screen criterion is and where the code for the screen resides. It also defines what to do when the screen throws an error (halt the process, send the record to a suspense file or merely tag the data).

The error event fact table also has a single column primary key, shown as the error event key. This, like the primary key of the dimension tables, is a surrogate key consisting of a simple integer assigned sequentially as records are added to the fact table. This key field is necessary in those situations where an enormous burst of error records is added to the error event fact table all at once. Hopefully, this won't happen to you.

The error event schema includes a second error event detail fact table at a lower grain. Each record in this table identifies an individual field in a specific data record that participated in an error. Thus, a complex structure or business rule error that triggers a single error event record in the higher-level error event fact table may generate many records in this error event detail fact table. The two tables are tied together by the error event key, which is a foreign key in this lower grain table. The error event detail table identifies the table, the record, the field and the precise error condition and likewise could inherit the date, screen and batch dimensions from the higher-grain error event fact table. Thus, a complete description of complex multifield, multirecord errors is preserved by these tables.

The error event detail table could also contain a precise date/timestamp to provide a full description of aggregate threshold error events where many records generate an error condition over a period of time.

We now appreciate that each quality screen has the responsibility for populating these tables at the time of an error.

Responding to Quality Events

I have already remarked that each quality screen has to decide what happens when an error is thrown. The choices are: 1) halting the process, 2) sending the offending record(s) to a suspense file for later processing, and 3) merely tagging the data and passing it through to the next step in the pipeline. The third choice is by far the best choice, whenever possible. Halting the process is obviously a pain because it requires manual intervention to diagnose the problem, restart or resume the job, or abort completely. Sending records to a suspense file is often a poor solution because it is not clear when or if these records will be fixed and reintroduced to the pipeline. Until the records are restored to the data flow, the overall integrity of the database is questionable because records are missing. I recommend not using the suspense file for minor data transgressions. The third option of tagging the data with the error condition often works well. Bad fact table data can be tagged with the audit dimension described in the next section. Bad dimension data can also be tagged using the audit dimension or, in the case of missing or garbage data, can be tagged with unique error values in the field itself.

The Audit Dimension

The audit dimension is a normal dimension that is assembled in the back room by the ETL process for each fact table. A sample audit dimension attached to a shipments invoice fact table is shown in Figure 2.

Figure 2: A Sample Audit Dimension Attached to a Shipments Invoice Fact Table

In Figure 2, the shipments fact table contains a typically long list of dimensional foreign keys each labeled with FK, three degenerate dimensions labeled with DD and six additive numeric facts. This style of fact table with its keys and other fields has been described many times in Kimball Group articles and books.

The audit dimension in Figure 2 contains typical metadata context recorded at the moment when a specific fact table record is created. One might say that we have elevated metadata to real data! The designer of the data quality system can include as much or as little metadata as is convenient to record at the time of an error. To visualize how audit dimension records are created, imagine that this shipments fact table is updated once per day from a batch file. Suppose that today we have a perfect run with no errors flagged. In this case, we would generate only one audit dimension record and it would be attached to every fact record loaded today. All of the error conditions and version numbers would be the same for every record in this morning's load. Hence, only one dimension record would be generated. The foreign key in the audit dimension pointing to the error event group table would have a single value corresponding to an error event group showing "normal" fact quality ratings for each of the six fact fields.

Now let us relax the strong assumption of a perfect run. If we had some fact records whose discount dollars triggered an out-of-bounds error, then one more audit dimension record would be needed to handle this condition. The error conditions and version numbers would have appropriate values, and the foreign key to the error event group table would point to the specific records in the error event fact table pertaining to this record. More details on the building of audit dimensions and the error event group table can be found in my book The Data Warehouse ETL Toolkit (Wiley, 2004).

The power of the audit dimension becomes most apparent when the dimension attributes are used in an end-user report as shown in Figure 3.

Figure 3: Normal and Instrumented Reports Using the Audit Dimension

The top report is a normal report showing sales of the Axon product in two geographic regions. The lower report is the same report with the out-of-bounds indicator added to the set of row headers with a simple user interface command. This produces an instant data quality assessment of the original report, and shows that the Axon West sales are suspect.

Six Sigma Data Quality

The data warehouse community can borrow some useful experience from the manufacturing community by adopting parts of their quality culture. In the manufacturing world, the Six Sigma level of quality is achieved when the number of defects falls below 3.4 defects per million opportunities. The error event fact table is the perfect foundation for making the same Six Sigma measurement for data quality. The defects are recorded in the error event schema and the opportunities are recorded in the organization's workflow monitor tool that records total number of records processed in each job stream.

The data quality architecture described in this article can be incrementally added to an existing data warehouse or data integration environment with very little disruption. Once the error event schema is established, the library of quality screens can grow indefinitely from a modest start. The screens need only obey the two simple requirements: logging each error into the error event schema and determining the system response to the error condition. Error screens can be implemented in multiple technologies throughout the ETL pipeline, including standalone batch jobs as well as data flow modules embedded in professional ETL tools.

Of course, the error event schema provides a quantitative basis for managing data quality initiatives over time because it is a time series by definition. The dimensionality of the error event data allows studying the evolution of data quality by source, by software module, by key performance indicator and by type of error.

The industry has talked about data quality endlessly, but there have been few unifying architectural principles. This article describes an easily implemented, nondisruptive, scalable, and comprehensive foundation for capturing data quality events as well as measuring and ultimately controlling data quality in the data warehouse.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access