Continue in 2 seconds

What strategy do you suggest to update dimensions using flat files as a source that do not have a time stamp?

By
  • Michael Jennings, Clay Rehm, Scott Howard
Published
  • November 05 2002, 1:00am EST

Q:

What strategy do you suggest to update dimensions using flat files as a source that do not have a time stamp? I've tried to compare the source record to dimension and update the dimension if the two are different. Performance drops as more fields are compared so I have moved to an "update all records" strategy regardless of changes. Is there an alternative?

A:

Scott Howard’s Answer: The best approach is generally to compare the source file to the previous generation copy to determine the differences. This file compare data change method is also supported by several file comparison utilities. IBM's Data Difference Utility which works in conjunction with their data replication solution can be downloaded from ftp://ftp.software.ibm.com/ps/products/datapropagator/fixes/. Here's a description from IBM:

There is a sample program provided on an as is basis, without support or charge to customers called the data difference utility (DDU) which can be used to approximate replication for VSAM or other flat or legacy files. This requires the customer have a way to produce DB2 Load ready files. Many utilities exist in the market place to create DB2 load-ready files from VSAM or other files. The method to capture the changes that occurred between 8 a.m. and 10 a.m. on the same day would be as follows: use the DDU to compare a sequential copy of a file that existed at 8 a.m. and one at 10 a.m., the DDU will write out the changes (essentially transactions) that occurred between these times in a format ready to be used to update a DB2 staging table (consistent change data table (CCD)) used by IBM's replication solution.

If you are not using IBM's data replication, the resulting CCD is simply a staging table that you could manually use to propagate these recognized and now timestamped changes on to your dimensions. However, I do recommend you look at that IBM replication solution because it's one of the most ubiquitous in the industry allowing any to any replication. Though it was originally DB2 based it now allows for replication from non-IBM data sources to non-IBM data targets including multiple joined sources originating from different RDBMSs.

Michael Jennings’ Answer: Fingerprinting relevant source data as it is loaded can provide an alternative means to effectively identify changes to the data warehouse. These fingerprints or encoding values can be used to efficiently compare relevant source data to the data warehouse thus avoiding multiple column comparisons, which can consume limited resources.

The first fingerprint method is the cyclic redundancy check (CRC). Use of CRCs does come with a risk with high-volume tables, data collisions; different data is encoded to the same CRC value resulting in incorrect ETL processing. This risk can be reduced by using a large CRC, 32 bit or higher. Many programming languages and DBMSs have CRC functions; in addition, many methods are available through the Internet.

The second fingerprint method is hashing, which processes a string of any length through a mathematical algorithm, which computes a fixed length hash value or message digest. Hash values differ from CRCs because they are collision-free. Some of leading hash functions are Message Digest (MD5) and Secure Hash Algorithm (SHA- 1). These hash functions and other are available on the Internet.

During ETL processing for a table, all relevant columns needed to determine change of content from the source systems are combined and encoded through use of a fingerprint algorithm. The fingerprint value is stored in a column on the dimension table as operational meta data. During subsequent ETL processing cycles, new source system records have their relevant data content values combined and encoded into fingerprint values. The source system fingerprint values are compared against the fingerprint values already computed for the same production/natural key on the dimension table. If the production/natural keys of an incoming source record are the same but the fingerprint values are different, the record is processed as a new SCD record on the table. Fingerprint values should be smaller and easier to compare during ETL processing resulting in increased performance.

Clay Rehm’s Answer: What are the chances of getting the timestamp on each record of the source flat file(s)? If that alternative remains slim to none, then you are probably doing the best you can do in this circumstance.

Les Barbusinski’s Answer: Most of the ETL tools on the market today provide the capability of extracting the "delta" from two versions of the same flat file. If you’re not using one of these tools, there are really only two alternatives: 1) hand code a script to perform a record-level compare of the two flat files, or 2) load both flat files into relational tables and perform a minus function to extract the delta into a third relational table.

In order to utilize option 1, the flat file in question must have a unique key. Your hand-coded script can then define the record as the key plus a single variable-length attribute, sort both files (old and new) by the key and create a delta file by matching on the key and comparing the single variable-length attribute of each file. Note that the delta file should have a transaction code appended to it to indicate an add, update or delete. The trick here is to compare the files at the record level rather than attribute by attribute. This method works fairly well and can be repeated many times once a template script has been developed.

Option 2 requires that your RDBMS support the minus function (I know Oracle does, but I’m not sure about the others). Here you need to load both versions of the flat file (i.e., old and new) into identical tables, then insert delta rows into a third table by performing the following SQL call:

INSERT INTO…SELECT * FROM TABLE_NEW MINUS SELECT * FROM TABLE_OLD

This effectively gives you all of the rows from TABLE_NEW that do not have an exact match (at the row level) on TABLE_OLD. You then need to perform a second SQL call to capture all of the rows that exist on TABLE_OLD, but not on TABLE_NEW (i.e., to identify those rows that were physically deleted in the source system since the last ETL cycle). Note that the delta table needs to have an additional transaction code column appended to it to distinguish the add/update rows from the delete rows. I’ve seen this method generate deltas for six-to-eight gigabytes of data in as little as 20 minutes; so it’s really fast.

Chuck Kelley’s Answer: First off, I would not support (except under very strong controls and accountability) the updating of dimensions already in the data warehouse. This is because you are changing history. If by update, you mean add new records, then that is good.

As for processing, I have added new rows a couple of ways – first, have the ETL process determine what is new and then the output is only those new ones to be added. This works very well. Since this was the solution you described, I am surprised that you are having performance problems. I would think that either you are not taking advantage of a feature in the tool that you are using or maybe the database is not using an index to the dimension table. The second way is to create a table with all the records and create an SQL procedure that will compare the two. If there is a duplicate ignore the row, if there is not duplicate, add the row. Again, my expectation is that this would run fast as well.

As for you suggestion of using wholesale replacement, I would offer a couple of comments of advice – 1) if you are using surrogate keys (if you are not, you should be!), then you have quite a mess on your hands because you have to match these together before you can process. I would expect that to take a while, so you aren’t saving anything. 2) You must make sure that all of the previous data is in the current data and that no data is ever changed or deleted.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access