With the increasing popularity of data warehouses and data marts, the ability to refresh data in a timely fashion is more important than ever. Strategies of the past included completely rebuilding the data warehouse periodically to ensure that information used for reporting was current. As the data warehouse/data mart increases in complexity and the demand for more up-to-the-minute data increases, the possibility of maintaining the data warehouse in this fashion becomes intractable. One solution to this problem is to monitor the operational data sources, detect changes of interest and populate the data warehouse with the deltas. This article focuses on the several approaches for capturing changed data that are available today. The majority of the data in operational systems is transient. This is a manifestation of the needs of the business. At any one point in time, the operational system contains only the latest version of all records in the database. This is the view that is of interest to the operational side of the business. In some cases, an operational system may implement a form of semi-periodic history (e.g., the last five changes are preserved for each record). The problem with this technique is that it is impossible to guarantee that all changes will be preserved. For example, we may have seven changes for a record and two will be lost.

For operational systems where history is critical to the business operation (e.g., the insurance industry or for database marketing programs), all changes to records are preserved for a set period of time. Since there is no guarantee that changes will be captured for us on the operational system(s), we must utilize a technique to capture the transient and semi-periodic data. Once changes of interest are identified, they must be stored in a persistent area to ensure that these records are available for updating the data warehouse. The level of information stored should be rich enough to allow the transformation application to make update decisions accurately. Typically, this level of information would be an indication of the event or the full record before or after the change.

Our main challenge is to identify changes within the operational system(s) that are of interest to the data warehouse at the appropriate granularity. Once these changes have been identified, the problem then becomes one of determining the actions to take based on these changes. In a data warehouse environment, different actions may be taken based on the type of change identified (insert, update or delete). For example, if we have created a retail data warehouse and over the course of time one of our products is retired, we would capture that change (a delete) and update the information in our warehouse correspondingly. In this particular case, we would most likely have the product information in a dimension table and the subsequent action would be to update the product dimension record with a termination date.

Summary of Techniques

There are a number of techniques that are available for incrementally updating a data warehouse, a data mart or another operational system. The following will discuss the most popular techniques, realizing that there may be variations of these approaches in practice. Changed data capture techniques fall into two broad categories: static and incremental. In all of these approaches, it is assumed that a detected change will yield a complete changed record to the requesting target. Based on the requirements of the solution and the implementation of the incremental capture technique, the before and after image of the changed record may be provided.

Static data capture is usually associated with taking a snapshot of the data at a particular point in time. In some cases, the complete data set may be retrieved, but more likely only a subset will be used. Static capture, timestamp capture and file comparison capture are the static capture methods discussed in this article. Except in the case of file comparison capture, the static capture approaches are relatively simple.

Incremental data capture is a time-dependent model for capturing changes to operational systems. This technique is best applied in circumstances where changes in the data are significantly smaller than the size of the data set for a specific period of time (i.e., the time between captures). These techniques are more complex than static capture, because they are closely tied to the DBMS or the operational software which updates the DBMS. Three different techniques in this category are application-assisted capture, trigger-based capture and transaction log capture. In circumstances where DBMSs are used to store operational data, transaction log capture is the most powerful and provides the most efficient approach to incremental capture

Static Capture Methods

Static Capture: This technique is the simplest of the data capture techniques. The fundamental principle here is to periodically take a snapshot of the operational system and then load the information into the data warehouse. The load operation can occur in two modes, either complete reload or append. Complete reload assumes that the tables being populated in the data warehouse are cleared by dropping and recreating each table or by deleting all the data in the tables prior to reload. This technique does not provide a mechanism for capturing historical data unless the operational system maintains periodic data. Append mode assumes that data exists in the tables to be loaded and places information into those tables based on predefined rules. For example, if a record exists, overwrite the entire record.

Timestamp (Database Scan) Capture: This method of incremental data capture is similar, in concept, to the static data capture method. The important distinction in the case of the timestamp approach is that all records contain information pertaining to the time at which they were last updated. These temporal indicators (tags), known as timestamps, provide the selection criteria for the capture of changed records. For example, the application that is responsible for the retrieval of changed records will know that, for the tables of interest, all records with a timestamp after the last time the program ran and up to the current time must be captured. This application would be required to maintain a persistent status of the point at which it "left off" so that there would be no overlap in the incremental capture process.

One advantage to this approach is that it is independent of database type. The second distinct advantage of this approach over the static capture approach is that the total volume of data is smaller. If the business does not require the capture of all changes in state for records of interest, this technique would work well.

The timestamp method suffers from the same problems as the static capture technique because it is difficult to capture intermediate states of data unless the operational system has been architected to do so. For example, if a record changed state five times since its last capture and the operational system did not maintain periodic history for the record, then the next time the information capture application ran, it would only retrieve the current state of the record in question.

An additional problem with this technique centers around the handling of deletes. Typically in an operational system, records that are no longer valid or needed are deleted. In order for the timestamp capture application to detect deletions, the record must be marked inactive until it has been captured. Then, it can be removed from the operational system.

File Comparison Capture: This method is also known as the snapshot differential method. This method works by keeping before and after images of files that are of concern to the data warehouse. Records are compared to find changes, and record keys are compared to find inserts and deletes. This technique is most appropriate in the case of legacy systems due to the fact that triggers typically do not exist and transaction logs are either non-existent or in a proprietary format. Since most legacy databases have some mechanism for dumping data into files, this technique creates periodic snapshots and then compares the results to produce change records. Certainly, all the problems of static capture are present here. Added complexity is introduced by the challenge of comparing entire rows of information and by key identification and matching. This technique is complex in nature and typically not desirable but, in some cases, may be the only solution.

Incremental Data Capture Methods

Application-Assisted Capture: This mechanism for data capture does not require a separate application/process to perform the change identification and collection. Instead, this technique requires the change detection logic to be implemented as part of the operational application. The underlying principle here is that when changes are written to the operational database, changes are also written to a persistent area for later retrieval. The latency in this technique is obviously minimal, meaning that the changed records are available for immediate use. Although this technique can be powerful when designed and developed correctly, it suffers from a few problems. Since the application is responsible for making changes to the database, it more than likely maintains only key information for records that require updates. Therefore, in order to write complete records to the change queue, the application would: 1) have to go to the database and retrieve the entire record; 2) apply the change; and 3) write it out to the change queue.

This can hurt performance and increase overall complexity. In addition, this technique requires that the application perform all of the computation necessary to accurately update specific records. For example, if the database is expected to add the current date to a record, then the application either has to perform that same function or go back and read the record after it has been written. One of the biggest drawbacks to the application-assisted capture approach is that adding this level of functionality to legacy systems can be difficult, especially since the originators of the code are often no longer within the organization.

Trigger-Based Capture: Triggers are stored procedures that are invoked when certain conditions or events occur. The granularity and flexibility of the conditions and events that cause a trigger to fire are database specific. This technique assumes that the DBMS supports triggers. Therefore, file-based storage cannot use trigger-based capture.

This approach is very similar to the application-assisted technique in that triggers, once activated, can perform specialized tasks. In this case, the invocation of the trigger can be used to save changed records of interest to a persistent storage area for retrieval. One drawback to this approach is that it should only be used in cases where the number of events captured is within the system's performance limits. For example, if all updates to a table are captured, this method creates twice the workload for the database. An additional problem with this approach is that it requires that the granularity of the information in the target database match that of the source database. For example, if we are interested in capturing bank transactions in the target database, and the source database only stores account summary information, then we may not be able to derive the transaction (event) information. One benefit to this approach is that data capture occurs at the source of the data and, therefore, both the before and after image are available.

Transaction Log Capture: Transaction log capture takes a slightly different approach to capturing the changed information. This technique leverages the logging and recovery capabilities of a DBMS and, therefore, sequential and indexed files cannot take advantage of this method. Since the transaction logs are utilized by the DBMS as a place to store transaction information for logging and recovery, it is the ideal location to capture changed information without directly impeding the performance of the database. When reading the transaction logs, particular care must be taken to obtain only information that has been committed by the database.

This technique limits the amount of work that the DBMS must perform, but an application must be written to monitor the log files and capture the data of interest. This technique is popular in some of the database replication techniques that are available today.

One of the drawbacks to this approach is that it depends on the fact that the transaction logs must remain available until the changes of interest are captured. For example, if a DBA decides to trim the transaction logs prior to the capture of the changed records, information may be lost. The log capture method is probably the most efficient approach to incremental capture due to the fact that log writing is already well optimized and available on most DBMS platforms today. In addition, log capture can be run on a separate processor to reduce the impact on the performance of the operational system.

Identify Needs and Goals

While there are a number of techniques available for the incremental update of a data warehouse, there is no one silver bullet. Each of the techniques discussed has pros and cons, depending upon both the defined business objectives and the technical operating environment (see Figure 1). A business must carefully identify its needs and goals with respect to an incremental update strategy and select the one that is most appropriate.

FIGURE 1: Incremental Update Techniques



Impact in Source Databases Impact in Oper- ational Appli- cations Complexity of Implement- ation Inherent Ability to Capture Historical Information Operational System Performance Impact
Static None None Low None (1) Low(2)
Time- stamp Low Low(3) Low None(4) Low
File Compar- ison None None High None(4) Low(2)
Appli- cation Assited None High High High High
Trigger High None Med(5) High High
Log Capture None(6) None Med High Low

For example, transaction log capture allows the detection and processing of changes to occur without directly impacting the operational system. This is very attractive in situations where the operational system is already over-utilized. Transaction log capture may be the desired approach. But, if you are using a source database that does not produce a usable transaction log, then an alternative solution must be selected.

As we proceed into the realm of terabyte data warehouses, the ability to rebuild the data warehouse from scratch on a nightly basis will go the way of the dinosaur. The logical and efficient approach to updating the data warehouse involves some form of incremental update strategy. The key to success is the ability to identify the changes of interest and to capture them in a timely manner that does not overburden the source operational systems, while still meeting the business needs of the organization.

REFERENCES
1 Limited to the frequency of snapshots, between snapshots no history is available (unless history is captured in the operational system).
2 Provided that snapshots of the data are taken during off-peak hours.
3 Low only if the operational application must insert the timestamp; none if another mechanism is used.
4 History is not available unless the operational system provides a mechanism to capture it.
5 Writing the trigger code can be difficult. Taking advantage of replication products is one way to ease this burden.
6 Assumes that recovery/transaction logs already exist.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access