On the first day, a data file was created. On the second day, the data file needed to be shared. From this simple beginning, data sharing continues to be a major issue for many companies. Business professionals seem to need corporate data in different subsets, groupings and time frames. To compound the problem, the same data is needed in a variety of different DBMSs, operating systems, hardware platforms and geographical locations. Business intelligence systems such as data warehouses, data marts and operational data stores are becoming more important to business success. Keeping the information in these systems current requires moving large volumes of data from operational systems such as IMS, VSAM or DB2 to the business intelligence system on a regular (weekly or nightly) basis. Typically the data movement must be performed within a brief time period (batch window) without impacting the performance or availability of the operational application or the availability of the data warehouse. As the volume of data to be moved increases, the data movement task grows in challenge and complexity. Fortunately, new data movement solutions are available that can:
- Move bulk data more quickly.
- Identify and move only the data that has changed since the last data warehouse update.
A new data propagation requirement is emerging in enterprises where data warehouses and/or operational data stores are evolving into mission-critical, real-time decision support systems. For these business intelligence systems, it is critical to refresh information within hours, minutes or even seconds of a change to the operational database. To address this requirement new, highly efficient legacy change data propagation solutions are available.
Here is a brief review of the most common solutions for data sharing via data propagation.
Bulk Extract: Unload utilities or copy management tools are used to extract all or a subset of the operational database. The extracted data is typically transported to the target using FTP or similar processes. The extracted data may be transformed to the target format on the host or target server. DBMS load products are used to refresh the target database. Since this approach does not distinguish changed versus unchanged records, it is most efficient for small source files or files with a high percentage of changes. It is least efficient for large files where only a few records have changed.
File Compare: A variation of the bulk move approach compares the newly extracted operational data to the previous version and creates a set of incremental change records. The incremental changes are processed using the same techniques as bulk move except that the incremental changes are applied as updates to the target server in a batch or scheduled process. This approach is most efficient on small files where only a few records have changed.
Change Data Propagation: This technique involves capturing and recording changes to the file or database as part of the application change process. A variety of techniques (triggers, log exits, log post processing or DBMS extensions) can be used to capture changes. The captured changes result in a file of incremental changes. Immediately after the source transaction completes (asynchronously from the source transaction), the change records may be transformed, moved and applied as transactions to the target DBMS. This type of data propagation is referred to as near real time or continuous propagation. It is the only data propagation that can keep a target DBMS synchronized within seconds or minutes of a source system. A variation of this approach queues the captured changes, and they are applied later in a batch operation similar to the file compare processing described above. Capturing the changes as they occur is usually more efficient than comparing two versions of a large file.
Bulk Data Propagation
In a simple "roll-your-own" bulk data movement scenario, a DBMS utility is used to unload the source database and create a file with the desired subset of records. An FTP utility is used to move the extracted file to the target system, and a DBMS load utility is used to load the target database. It is interesting to note that the combined time for the unload and transport tasks is usually less than the time required to load the target DBMS. Using this approach, throughput rates for unload, move and load are about 1GB per hour.
A performance boost can be obtained by eliminating the intermediate work files using data piping; extracted records are piped directly from process to process without creating intermediate files. The entire process can often be accomplished in the time needed to load the target database.
Additional improvement can be achieved if the bulk data propagation tool supports parallel unload and parallel load capabilities along with comprehensive data piping from the source unload to the target load. These techniques enable throughput rates up to 4GB per hour for the same data file described earlier.
Change Data Propagation
Performance of change data propagation must be looked at from two perspectives:
- The impact capturing changes has on the operational applications and environment. (This includes impact to the response time of production transactions, to the elapsed time of batch jobs and to critical system resources such as logging.)
- The throughput capabilities of the data propagation process to capture, transform, move and apply captured changes to a target database.
The impact that capturing changes has on the operational environment depends on the technique used to capture changes.
Trigger-based propagators invoke a procedure or trigger that writes the relevant update information to a separate staging table as transaction updates occur in the source database. The cost of invoking triggers, updating the staging tables and logging the staging table updates can be more costly than the original transaction. Since this work is performed as part of the original transaction, the end-user response time can be impacted. Triggers are not supported in legacy environments such as IMS and VSAM. Trigger-based solutions are generally used in low volume applications capturing just a few transactions per second.
Log exit-based propagators capture changed records while they are en route to the database log and write relevant update information to a staging table or file. Log exit-based propagators require that the DBMS log full before and after images of updated records in the databases they are propagating. Log exit-based propagators also introduce additional logging for their staging table updates. This additional logging may be unacceptable in high-volume transaction environments. Log exit processing can also have some impact on transaction response time. Additionally, log exit solutions cannot be used in batch environments when logging is not turned on.
Log post processing involves processing of database logs to collect changes to the database. Log post processing propagators require logging of full before and after images for the databases being propagated. While the increase in log volume needed for the full images might seem trivial, it can be unacceptable in some high volume transaction environments. Log post processing is normally not suitable for near real time propagation because of the latency period in accessing the log. As with the log exit-based propagation, this technique cannot be used in batch environments when logging is not turned on.
DBMS extension propagation techniques involve adding changed data capture function as a part of the native DBMS code path via vendor supplied functions. Since the added function executes as a part of the DBMS code path, it participates in all database updates, transaction commits and DBMS restarts without the overhead normally associated with triggers. Captured changes are recorded in an external file and not in the DBMS. This technique has the advantage that it does not require or impose any additional logging for the DBMS not even for the records being updated. Thus, it can support high-volume, high-performance applications. Since it does not require logging, it can be used even in batch applications when no logging is taking place. This technique is especially suitable for continuous or near real time propagation.
Because transactions vary in terms of the number and size of the records they update, the number of transactions captured in a time interval is not useful as a throughput measurement. The total number of bytes of all changed and captured records is a more meaningful throughput measure. In addition, it is easy to compare this number to the bulk move numbers described earlier. As with the bulk move process, the most intensive part of the change propagation process involves transforming legacy data to relational and actually updating the target tables. Factors such as the complexity of transformations, user exits, handling repeating groups and the number of target tables can influence transformation throughput. Since SQL updates (inserts, updates and deletes) are used to update the target tables, the number of indices on the target table and the frequency of commit processing have a significant influence on throughput. If multiple processors are available on the target system, it is important that the data propagation system provides parallel processing for data transformations and for SQL updates. Change data propagation solutions with parallel processing capabilities can achieve throughput rates up to 300MB per hour.
The bulk and change propagation solutions must work together. In many cases bulk propagation is used to initially create a target table. Change propagation is then used to keep the table current. For data consistency, it is important that both tools perform the same data transformations. If a target table is recovered back to a point in time that removes applied changes, the change data propagator must be able to redo the changes that were lost due to the recovery. The same is true for point-in-time recoveries of source files: the system must be able to deal with and re-synchronize target tables.
Most large enterprises need both bulk and change propagation solutions. It is critical to select the data propagation solution that best fits an application.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access