Recently BMC Corporation announced their entry into the data warehouse systems management environment. BMC announced a number of systems management products for data warehousing, but the two most notable products were the ChangeDataMove and DataReach. ChangeDataMove is designed to allow incremental updates and changes that occur in the operational transaction processing environment to be reflected in the data warehouse. The generic name for the technology that BMC has built is typically known as "changed data capture." Changed data capture has long been one of the essential missing ingredients of the data warehouse environment. Technically speaking, prior to BMC's entry, there have been changed data capture products in the marketplace. But these early products have had some notable limitations, and it is fair to say that the BMC entry is one of the first industrial-strength changed data capture products on the market and is certainly the most technologically advanced.
Earlier changed data capture products trapped the incremental changes in the log or journal tape that is created in the operational environment as a byproduct of operational transaction processing. The log tape provided a convenient location for operational changes to be read and used as a source of data to be updated/refreshed into the data warehouse.
BMC's ChangeDataMove operates at a fundamentally different level than the other log- based products. BMC's ChangeDataMove selects the changed records during the actual I/O operation in the operational environment, as the data is being altered. In IMS and VSAM, the change is captured in the same unit of work as the operational transactions. In DB2, IBM's instrumentation facility is used to capture data changes. This allows the data changes to be captured almost instantaneously, much faster than if a log tape were used. The speed that is possible opens up the door to support of a data warehouse or an operational data store (ODS). BMC's ChangeDataMove services the movement of data from IMS, DB2 and VSAM to Oracle, DB2 and SQL Server. ChangeDataMove is especially applicable when only a small portion of a database has been changed.
There are many advantages to industrial-strength changed data capture:
- No full table scan of the online transaction DBMS is required,
- Only data that has changed needs to be transformed and transported to the target, and
- Targets can be kept current and consistent with the source databases. (This is particularly suitable for supporting class I ODSs as well as classical data warehouses.)
For a shop that is getting ready to go into data warehousing and has a significant volume of data and/or transactions to be processed, changed data capture makes imminent sense.
The second product announced by BMC is the DataReach. BMC's DataReach resides on a UNIX host and works with the EMC Symmetrix Storage System. This combination allows the UNIX host to directly extract mainframe DB2 database information (bulk data snapshots and full tables) which reside on the Symmetrix storage. The extract is then converted on the open systems server to the specified open systems RDBMS format by the DataReach software. Then using a native database loading utility, the data is loaded into any UNIX storage specified for the open systems server.
The DataReach solution is unique because, for the first time, software works directly with storage to move database information between MVS and UNIX environments. In fact, DataReach accomplishes the extraction and movement of DB2 databases without utilizing any mainframe CPU cycles or any network resources. A key benefit of the DataReach product's high-speed performance is that it enables frequent and extremely fast loading or refreshing of the distributed systems data warehouses.
DataReach is a joint venture of EMC and BMC. The future direction of the venture is the ability to allow data to be moved from one environment to another entirely inside the disk controller. In the future, there will be no need to use CPU cycles or even go to the processing unit itself. Transaction- based data will be able to be moved from the operational environment to the data warehouse environment without ever consuming precious transaction processing cycles. Speeds up to seven times as fast as movement across the network are claimed by the DataReach. The movement of data for DataReach is today limited to DB2 to Oracle, to Sybase or to Informix. But it is likely that in the future BMC/EMC's movement capability will include other DBMS technology as well. For corporations that want to get into data warehousing and are already running short of precious resources, both BMC announcements promise very important relief.
The mature use of technology to accomplish important objectives is the first significant milepost. But there is another significance to the BMC announcements. The entry of BMC into the ETL and data warehouse systems management space is a sure sign that data warehousing is reaching maturity.
In order to understand the impact of the BMC announcements, it is necessary to understand a little bit about the state of data extraction in the data warehouse environment and how data extraction has evolved. In the beginning, there were simple ETL software packages that allowed data extraction and transformation to be done in the operational host, usually a mainframe server. Figure 1 shows this normal scenario.
Figure 1: The classical stance of ETL extraction and transformation is done in the application processing server, usually a mainframe. The load is done in the data warehouse server, usually a UNIX-based server.
In Figure 1 extraction and transformation are done on the server where transaction processing is also being done. One of the advantages of this style of ETL processing is that complex transformations can be accommodated. For example, if a transformation needs to verify data values with tables outside the immediate stream of extraction, there usually is no problem in doing so. One of the disadvantages of this approach is that machine cycles are used which may be very precious. For example, if the transaction processing is being done in a mainframe, there may not be many spare mainframe machine cycles available for extract and transformation processing.
The next approach to ETL processing is that of merely extracting the data on the transaction processing server and then moving the data to the data warehouse server for transformation processing. Figure 2 illustrates this approach.
Figure 2: One of the variations is the movement of data to the UNIX environment where transformation and loading are done.
The advantage of the approach is that cheaper machine cycles are used for the transformation processing. Since transformation processing is done in the data warehouse server usually a UNIX server the cost of processing is considerably lower than that in Figure 1. One of the disadvantages of this approach is that only very simple transformations can be accomplished. For example, if a transformation requires more data than that which is immediately available as part of the extraction stream, then the transformation usually cannot be accomplished.
Figure 3: The log tape from transaction processing can be used for data warehouse refreshment.
The third approach to ETL is that depicted in Figure 3 which shows that transaction processing throws off log and journal tapes as part of regular transaction processing. These tapes can be used as input (i.e., changed data capture) into the refreshment process for the data warehouse. The problem in times past has been that reading and interpreting the log tape has been a hassle. Log tapes are designed by systems programmers for the purpose of backup and recovery. The fact that they can be used for anything else is a miracle. Complexity does not begin to describe the internal structure of the log tape. Therefore, a vendor supplied utility is required to pull data from the tape and to format the data into a usable structure.
The fourth way that ETL processing can be done is to move the data from one environment to another inside the disk controller itself. Figure 4 shows that transaction processing data is passed to a disk as a part of standard transaction processing. As the transactions are executed, the data is moved to a technological environment suitable to the data warehouse. Once the data is passed to the data warehouse environment, it still needs to be transformed. The transformation can be accomplished by one of the standard ETL tools and can be done in a less expensive environment. This option makes sense for shops that are facing the prospect of transforming a lot of data and a lot of transactions.
Figure 4: With the BMC/EMC option, the extraction of data can be done inside the disk itself.
One of the interesting uses of the disk controller approach to the movement of data is the support not just of the data warehouse environment but support of the class I ODS environment as well. For quite a while now, those organizations wishing to build and operate a class I ODS have been faced with a paucity of technology. Now ChangeDataMove from BMC offers support.
These announcements from BMC signal that data warehousing has passed from the "interesting" phase to the "industrial-strength" phase.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access