Over the past several years, the importance of the data warehouse, and now the data mart, in making sound business decisions has become well-established. Throughout this process, customers and vendors alike have talked about the need for changed data propagation since it is not feasible to completely rebuild/refresh a large data warehouse on a regular basis; and, generally speaking, only a small fraction of the records in an operational database is actually changed during any given period.
Ideally, a changed data capture mechanism that automatically identifies where changes have occurred would eliminate the need to read and move data that has not changed. There are several strategies that can be used for capturing changed data. Most database management systems provide support for at least one of these strategies. However, a number of complexities arise when trying to use these capabilities to create efficient mechanisms for updating data warehouses.
The following discussion is divided into two primary parts. Each addresses a component of the hidden complexities in changed data capture. The first section identifies and discusses some of the issues surrounding the various changed data capture techniques. The second identifies and discusses some of the more prevalent issues surrounding the actual propagation of the changed data.
Changed Data Capture
If all records of the data sources included a date and time stamp, creating programs to update a warehouse would be relatively easy. However, this feature is rarely found in all of the source systems. Other options for capturing the changes in operational systems include the use of data replication, trigger processing, database log file extraction, application modifications and file dump comparisons. Following is a discussion of the difficulties with each.
Data Replication
The term data replication refers to reproducing the current state of the data in one system in another environment (e.g., the data warehouse). With this approach, there is usually little or no attempt to reorganize, merge or transform the data. Replication products typically support both event-driven and/or scheduled updates. Some offer transformation capabilities within the bounds of SQL-type logic. However, these capabilities are frequently insufficient for the creation of a data warehouse.
As an example, consider a data mart that represents units of products sold by a sales force over time and territory. In the source file, the sales representative name is kept in an alphanumeric field. Over time, many data entry clerks have updated this file; and, as a result, inconsistencies have developed in how the names of the sales rep data are represented, and the data is not clean in this source. A data cleansing or data quality capability would be required to determine if sales representative John Smyth is the same as J Smith or J Smyth or even John Smyth Jr. This type of data cleansing can be referred to as householding and is frequently used whenever name and address information is concerned. While the replication product could determine when a sales representative had sold additional products and ensure that the information is represented in the data mart, it could not handle the complex logic required to perform the householding task. Without the data cleansing, the data mart would contain inaccurate data that could dramatically skew results.
Trigger Processing
Most relational database management systems provide exits that are triggered by various kinds of changes to a table. Service routines can be attached to these exits to create tables of changes to the source table. The DBA can also modify the table definitions to add a date and time stamp field and event triggers that automatically update the date and time stamp fields whenever a row is added, changed or deleted. While these techniques do work, like any user exits, they add to the processing overhead. Even more importantly, without a corporate meta data and/or warehouse methodology, different groups could create multiple, incompatible trigger routines for the same value.
Database Log File Extraction
Most database logging facilities automatically capture the before and after image of any change to the database in order to support rollback for transaction processing. Typically, the database system creating the log can be configured to identify which transactions or types of changes are to be logged and what data is recorded. In addition, logs are likely already in use for backup/recovery purposes, so no new resources are required to create or maintain this option. Logs, however, are also not a perfect solution.
Some database vendors change the format of their logs with every release of their product. Also, if more than one value is used to create the target value for the warehouse, what's captured in the log may not be sufficient to create the value needed to update the warehouse/mart. If the keypath for the updated record is captured in the log and the database is "quiet" between the time the update is recorded in the log and the time the update is carried out, it is possible to use the log data to retrieve the other data values used to compute the target value. Otherwise, it would be necessary to write a custom user exit for the field in question. Finally, log tapes can be voluminous and may include aborted transactions, requiring a sophisticated program to sort out the different event types to ensure that only the proper records are captured for inclusion in the data warehouse.
Application Modification
Another option, which can be necessary in some cases, is to actually change an application to capture the data necessary for changed data capture. For example, in a case where several values from a source database are used to create a single target value for a warehouse, the native logging capability is frequently not sufficient. In addition, legacy data access systems such as sequential file systems like VSAM may not provide logging facilities. In these cases, the application may have to be modified to capture the changed data. Application modification requires programmers to modify the legacy application to generate a record that contains all the values used to compute the target value every time an event (insert, change, delete) occurs. The changed records are stored either in the log, a file or a table depending on the database management system in use. This approach works, but it incurs a performance overhead, is labor-intensive and costly to maintain. Moreover, inconsistencies can develop if different people are responsible for modifying each application.
File Dump Comparison
In the event that replication, log files, triggers or application modification are not available, one other alternative-file dump comparison-remains to identify records that have changed. When the data mart or warehouse subject area is initially loaded, a full source file dump is usually produced. The extraction file is either created programmatically or via a dump utility and is typically stored as a temporary file. Once the warehouse is successfully loaded, the temporary file is erased. In the case of file dump comparison, however, the file would be retained. During the warehouse update run, another full file dump of the source file would be created. A step is added to the update job process in which the new file is compared with the file from the previous run and a change file is created to identify any changes that have occurred.
On the surface, this seems like a simple solution to the problem. However, there are several drawbacks to this technique. First, this technique requires a two-way comparison to detect deleted records. A warehouse dump must be compared to the later file to determine which records in the warehouse no longer have corresponding records in the source system.
A second drawback is the processing time required to read the entire source file, perhaps multiple times, for each periodic update. For companies with a tight batch window, this may not be practical. In addition, this technique shares a problem with the other change capture options which results from the need to use either fields from several sources or multiple source values from a single file to calculate a particular target field.
Scratching the Surface
As a result of the above complexities-and the fact that more than one of these approaches may be required to update any particular warehouse-a customer should be wary when a vendor simply claims, ·We support changed data capture,º since capturing the incremental data that has changed only scratches the surface of the problem. In reality, capturing changes pales in comparison to synchronizing all of the data required and then propagating those changes to a warehouse table.
The next section of this article investigates some of the challenges that exist in actually propagating the changes that have been captured to keep the data warehouse updated.
Changed Data Propagation
When changes are propagated to the warehouse, one or more data warehouse tables can be affected. In order to simplify the discussion that follows, we will assume that only a single table is affected by each set of captured changes.
A data warehouse table derives its values from one or more sources. The data can be derived from a single source, multiple sources of the same database type or multiple sources of different database types. The number of sources does not stand alone in creating a changed data capture and propagation process. Each sourcing option shares the same propagation issues of how to handle deletes, inserts and updates. The complexity of these issues builds successively as the data warehouse values are derived from a single source through the multiple source options. And, regardless of the number of sources that provide data for a warehouse table, several application issues involving database recovery and transaction processing also affect changed data propagation. The success of the changed data propagation process depends on understanding these issues.
Single Source
The simplest changed data propagation case is a single source (e.g., a single segment within a hierarchical or network database such as IMS or IDMS, a single table within an RDBMS database or a single record type within a flat or VSAM file) for each target warehouse table. If the change record (whether it was created by date/time stamp, log, trigger, event, application modification or file dump) contains all of the fields necessary to create the target warehouse table, then change processing can proceed. However, if the change record does not log all of the fields required to populate the target table, then it is important that the automated procedure initiate a separate process to match the change record file with the original source to capture the missing information.
Deletes
One issue to consider when applying changes, even in this simple scenario, is how to handle deleted records. If a record is deleted from the legacy system, should the corresponding warehouse row be deleted as well? Not if management might want to look at something such as ·customers lost.º In this case, a new record might actually be added to the warehouse or an existing record might be updated to reflect the effective date of the ·customer.º Another variation of the deleted record is a source record that is changed and then deleted prior to the data warehouse update. Should the record be ignored, or is it important to reflect all changes that occur to the data?
Multiple Sources-Same Database Type
Unfortunately, a single source is not the predominant case when building a data warehouse. This generates another level of complexity to our change propagation scenario.
Multiple sources can occur in two ways. The data values can come from multiple locations within a single operational source. Examples include a flat file with multiple record types or a non-relational database where several segments supply data. Data values can also come from two or more physical sources of the same database file structure. Examples include multiple relational tables, flat files or databases. In each case, the two issues to consider are inserts of new records and changes to existing records.
Inserts
If a record is added to one of the sources, it is possible that no inserts to the remaining sources occur at the same time; consequently, there are no corresponding values from the other source records. As an example, consider an application that tracks customer and shipment information. A shipment request has been made, but no customer information, other than the key, has been entered into the database yet.
The question is how quickly the warehouse needs to be updated. One option would be to skip the creation of a new row and to hold the partial record for later processing when all data is present. This raises its own issues of how to determine that all of the data is actually present. It might be possible to time the reading of the logs to coincide with a "quiet time" for each source table or database so that there is a greater likelihood of data consistency within each system. In our earlier example, the people who manage the source databases know when updates to customers are added and should insure that the data warehouse update is not run ahead of this activity.
A second option would be to add the row in the warehouse table with nulls or defaults for all of the missing data. When the customer data is actually completed, will it be necessary to run a process that updates the target row to reflect this? If the answer is yes, then a process must be implemented to identify when the source data is complete. This information can then be used to update the existing row in the warehouse.
Updates
A similar set of dilemmas arises when source records are modified. If all sources have been updated, then a log record should exist for each. It is important to gather the log records at an appropriate time, as indicated for inserts, to ensure that the most current data is available before beginning to process.
If all related sources are not updated, then a decision must be made about what attempt will be made to locate the missing data. If an assumption can be made that since the values were not recorded as ·changed,º they would be the same as those that are already in the warehouse. The propagation process would then be to read the current data warehouse table to capture these values and use them to complete the new row. However, if all related sources are not updated due to application issues, it may not be practical to assume that current warehouse data is correct. In this case, one option would be to add the new row to the warehouse table using missing values or defaults as indicated with inserts. A second option would be to write or generate a program that uses the keys of the available source data to attempt to find corresponding source data records so that the data for the warehouse row can be completed.
Another related issue to consider is aggregate data. Certain types of aggregation, such as averages and some types of ranks, are not possible to propagate under any circumstances. In order to accommodate this type of data in the warehouse, it is essential that the warehouse schema be designed so that these values can be calculated during the load process.
Multiple Sources-Different Database Types
The most complex case of multiple sources occurs when the data comes from different database types (e.g., IMS and DB2, flat and relational). A subject area within a data warehouse often utilizes data from more than one application. This means that it is highly likely that data for a single warehouse table will be derived from heterogeneous data sources.
In addition to evaluating how to handle deletes, inserts and updates, working with multiple sources of different database types raises two other issues: change capture technique and synchronization.
Change Capture Technique
The technique used to capture changes for each database type may vary. A database or file system may have no log files at all. VSAM or flat files fall into this category. Other database systems may format their logs differently or allow the user to select a formatting option. Some logs capture only concatenated keys, while others capture the entire record that has changed, with a before-image as well. These different log styles require that planning take place to find some commonality. A set of changed data capture steps must be designed and prepared to ensure that all of the appropriate available data is gathered from the disparate logs, missing data is located and retrieved wherever possible, and all of the data is then merged together so that the target table can be populated.
Synchronization
A bigger issue with multiple heterogeneous sources is synchronization. As each operational system is updated during normal business, the database management system creates a "log" in the background. If data is needed from multiple logs, the timing of this job is important to ensure that the log for database "A" contains records for the same time period as the log for database "B." An example will clarify this.
Perhaps an insurance company wishes to track policies issued by state and county within state. A master file exists with each customer's ID and the policy held. However, the address information for each customer comes from a billing system using a different type of source database. A customer calls an agent to request a new homeowner's policy and to report a change in address which reflects a move to a different county. The change in policy is added immediately to the master file so that coverage can begin. However, the address change is held in the office until week's end so that all address changes can be entered together. The data warehouse update is run prior to the week's end. The logs reflect new policy information but no changed address information. A process is then required to locate the missing data from the original source database and to merge it with the captured data from the log. How can you assure that when you go to get missing data that the state of the acquired data is correct for the time that the capture is made? In our case, the billing application would not yet reflect the address change. Therefore, the data is out of sync. As a result, the data warehouse would contain the wrong totals for each county. Tracking when or how updates are made to operational systems that use different types of database management systems and then coordinating this with the timing of warehouse update runs to ensure the data is synchronized becomes a manual process requiring additional planning.
Application Issues
Each of the above scenarios leads to the conclusion that automating changed data capture and propagation for the update of a data warehouse may not be as straightforward as it seems. In addition to the complexity introduced by various combinations of source database types and the type of change that has occurred, there are four application issues that require resolution: database recovery and restart, consolidating multiple transactions, missing data and incomplete unit of work.
Database Recovery and Restart
Whenever a database management system is used for an application, it is important to plan for database recovery and application restart. If a database is recovered and the application starts over, then the log may contain duplicate entries for changed records. It will be necessary to plan how to handle these duplicate transactions. It is possible that the duplicates do not cause problems. In this case, processing can proceed without any change.
If duplicate transactions do cause a problem, they must be addressed. Duplicates can be troublesome whenever counts, sums or aggregations are involved in determining target data. One method might be to skip records on the log by moving forward in the log to a specific time. If the data warehouse has multiple sources, this will add to the complexity of recovery.
Consolidating Transactions
Any changed data capture method that records transactions against a data source can result in capturing multiple transactions for the same source records. To carry our insurance example forward, consider a policyholder who updates an auto insurance policy by adding a new driver. At the same time, the policyholder might add/change/delete a coverage and modify a deductible as well. Each of these changes can appear as a separate transaction in the log file. Under most circumstances, it will be desirable to consolidate these transactions into a single transaction that represents the net, accumulated result of the entire set. In other circumstances, it may be desirable to have the data warehouse reflect each individual change. The automated process needs to be able to reconcile records through time sequencing in order to provide the net result.
It is also possible for inconsistencies to appear in the log when multiple transactions exist for a given record. Examples could include a delete before an update or an update before an insert or an attempt to insert a duplicate record. A database management system, such as DB2, would reject these transactions as they occur and issue a rollback. However, the transactions would still be entered in the log. When consolidating transactions, it is important for the automated process to have the intelligence required to skip over invalid transactions.
Missing Data
It is possible that a target warehouse table requires data that is not contained in any of the sources. This can occur when table lookups perform data replacement and data expansion. It can also occur when default values or business rules are applied to determine the actual value of a target warehouse table field. It is important to ensure that the automated change process can accommodate these missing data requirements to prevent the need for manual programming. One way to solve this type of problem is to design the entire warehouse schema and processes so that transformations and calculations take place whenever possible on the target side before the data is loaded. An extension of this is to ensure that the automated propagation process is capable of handling target-side data manipulation.
Unit of Work
The last application issue to be discussed involves an incomplete unit of work. A unit of work is one or more related transactions that are logically processed together as a whole. If any single component of the unit fails, all components fail. The unit does not make permanent changes to the database until all components are complete. The most common example of unit of work processing is an ATM transaction. A bank customer uses an ATM to transfer money from a savings account into a checking account. After processing the withdrawal from the savings account, but before the money has been added to the checking account, the computing machine experiences an outage. When the machine comes back up, the savings account should reflect the same balance it had prior to the start of the original transaction. Let's look at how a unit of work can affect log processing.
Typically, when a log file runs out of space, it is archived and a new log file is started. It is possible for a log to become full before a unit of work has been completed. In this case, part of a transaction set would be recorded on one log and the remainder on another. When the automated capture process runs, it is optimal to read only archived logs so the performance of the production system is not affected. Because of the possibility of an incomplete unit of work being reflected in the log file, it is important to verify that the automated process captures these incomplete sets and holds them in some type of recycle file for future processing rather than applying them. If the incomplete transaction sets are not held, then the warehouse will reflect incomplete and/or inaccurate information.
Conclusion
Maintaining a data warehouse can be more complex than creating and loading one. It is essential to the value of the warehouse that an automated procedure be established to capture changes to the source data to optimize processing efficiency. However, different source environments and their relationship to the data values contained in the warehouse increase the complexity of capturing changed data. Designing an approach that provides both currency and accuracy of updates can be extremely challenging and needs careful analysis.
Meryl Surgan of Evoke Software, formerly of ETI, has over 20 years of data processing experience. For the last four years, she has concentrated on data warehousing and understanding the issues surrounding their implementation. Surgan has been a speaker on these topics at various DAMA, DAUG and other similar group meetings/conferences.
Gil Comeaux of ETI has 35 years of experience in OLTP and distributed processing.
Prior to joining ETI, George Fisher worked for 20 years as a consultant in the mainframe industry. He is currently working on the data systems library for SAP R/3.









