Refreshing data warehouses can be a challenge. Truncating and reloading tables is time-consuming and wastes I/O, but common incremental refresh techniques have their problems, too. Using date-time stamps to capture row changes, for example, can turn into a major software project and puts additional processing on source systems. Log-based replication is another possibility, but it can be tricky to set up and monitor. And while third-party tools eliminate the need for custom code, they also cost money.

 

Some firms might still benefit from a batch-based, incremental refresh approach that doesn’t impact source systems, can be coded using only SQL, and requires no third-party software. Data fingerprinting, which detects row changes by comparing old versus new cyclic redundancy check (CRC) or hash function values, may be such a solution. That is the focus of this article.

 

Data Refresh in General

 

The goal of any refresh method is simple: to resynchronize a target table with its source system counterpart. A brute force way to accomplish this is to truncate all the rows in the target table and then reload them from scratch. This method is simple and forgiving, but it’s usually too slow for larger tables. It’s also wasteful of I/O, at least for normal data warehouse processing, where tables tend to grow and change slowly.

 

Incremental refresh methods try to take a more economical approach by identifying which source rows have been added, dropped or changed since the last refresh. At a row level, this means identifying exactly four conditions.

 

 

Incremental solutions require rows in source and target tables to share a common identifier, or key. During the refresh process, source and target keys are matched up, then each row of data is compared (Figure 1). In Condition 1, the source has a row the target does not, so the new row must be inserted into the target. In Condition 2, the target has a row no longer in the source, so this row must be deleted from the target. In Condition 3, the row exists on both systems, but one or more columns have changed, so it must be updated on the target. Finally, in Condition 4, the row exists on both the source and target but is unchanged, so it does not need to be processed any further.

 

Not surprisingly, in a typical data warehouse, the fourth condition is much more common than the others. If a warehouse has three years of data and is refreshed daily, each day adds only about 1/1000 to its total storage. Of course, a few rows may get updated or deleted, but the vast majority of the rows arriving from the source are simply inserts. Nevertheless, if we want to avoid using a brute force truncate and reload, all four conditions have to be identified accurately and quickly, and that can be tricky. We look next at two common approaches.

 

DateTimeStamp: One method to detect row changes is to add the system date and time to each row as it is inserted or updated at the source. Then, during a batch refresh, only rows added or changed since the last refresh are retrieved for processing in the data warehouse. Deletes, if allowed, have to be managed by a separate query pass. From a data refresh perspective, this technique sounds promising because it should only use a fraction of the I/O required by a brute force method. Viewed from its impact on the source system, however, the DateTimeStamp solution is daunting. Not only does it require adding date/time columns to every source table, it also means writing triggers or other custom code to update them.

 

Replication: A second method for detecting row changes is log-based replication. Here, a dedicated process polls the source system log files for changes, then ships them over to the data warehouse for processing. This solution has been around for some time and is the method of choice for real-time data warehouse refresh. But it also takes expert setup and monitoring, and can require third-party software tools.

 

But if a company doesn’t really need continuous updates, is there a good batch solution? It should be significantly faster than a brute force truncate and reload, but simple to implement and maintain. And, it should reside mainly on the data warehouse. This will avoid adding code, storage or processing to company source systems. That is precisely the spirit of data fingerprinting, and we explore that solution next.

 

Data Fingerprinting in Particular

 

The basic idea behind data fingerprinting is to use a formula-generated number or string to act as a surrogate for the data contained in each row. This may be an unfamiliar concept to some readers, so it deserves an immediate example. Imagine an Employee table with four data columns, where the first column is the primary key. We take a snapshot of its rows on Monday and generate a calculated fifth column to hold a “fingerprint” (See Figure 2).

 

 

On Tuesday, we perform the same process (See Figure 3).

 

 

Next, imagine that we save these changes not on the source system, but in a data warehouse table created expressly for storing just keys and fingerprints. After Monday’s refresh, this data warehouse table, named Employee Fingerprint, looks like this:

 

 

On Tuesday, after the refresh process has read the employee source table, the Employee Fingerprint table looks like this:

 

  

From a data refresh perspective, the Employee Fingerprint table shown in Figure 5 has everything we need. Tom has received a raise, so his NewFingerprint value is now different from his OldFingerprint value; his row needs to be updated. Dick’s row is precisely the same, so his fingerprint values are, too; hence, we can disregard his row in the rest of Tuesday’s processing. Meanwhile, Harry has been removed from the table and Alice has been added, so we will have to accommodate these events. It is worth pointing out that we cannot recreate rows from fingerprints; we will have to read the source table again to retrieve inserts and updates. For additional background on data fingerprinting, see Michael Jennings’ prior DM Review articles listed in the references section. 1, 2

 

Fingerprinting has advantages - unlike DateTimeStamp or Replication methods, the value of a fingerprint can be calculated any time, just by reading the row. So there is no need to store date-time stamps in the source table or to monitor source system logs for changes. Also, fingerprinting can remove most of the storage and processing burden from the source system, and let the data warehouse decide which rows to process.

 

Head-to-Head Tests

 

To see how load times for data fingerprinting compared to a brute force method, I did a number of tests in a controlled environment. They included two representative tables - a “narrow” table with only five columns, and a wide table with 50 columns. Both had a mix of normal data types: integer, datetime, float and character. To mimic real-world conditions, the tables were made large. The narrow table had 100 million rows, and the wide table had 10 million rows; that gave each table and its associated indices a storage footprint of about four gigabytes. Separate tests were done for targets with one, three and six indices, and for a percentage of rows changed, that varied from 0 to 10 percent along five data points. Figures 6 and 7 show the results.

 

 

 

Under normal data warehouse conditions, where tables change slowly between refreshes, data fingerprinting ran several times faster than brute force, especially for the wider table. When you think about it, this makes sense. Fingerprinting gets its advantage from storing a string or integer that is (usually) much smaller than the row itself. But, as table configurations go from wide to narrow, fingerprinting loses its leverage. Using it on a table with only one column is silly; at the other extreme, using it on a table with 300 columns is very efficient because it avoids a lot of unnecessary I/O.

 

The code tested uses a stage table to capture changes, then updates the warehouse target from the stage. This makes the solution more forgiving, both in terms of missed runs and process failures. If a scheduled refresh gets missed for a week or a month, the target still accurately realigns with its source. On the other hand, a refresh can be re-run successfully 10 minutes later, should that be desirable. And if there is a failure at any step, a refresh can be re-run without risk.

 

[Note: Tests were performed on a Dell Pentium D 2.80 GHz computer with 3.5 GB of RAM. The operating system used was Windows XP Professional, version 2002, running service pack 2. The database platform used was Microsoft SQL Server. Fingerprint values were generated using SQL Server’s checksum() function.]

 

Concerns about Collisions

 

When a row changes, its fingerprint should too. We don’t want this to happen (see Figure 8).

 

 

 

Tom’s row has changed, but the CRC function has produced the same fingerprint. This is called a collision, and it’s bad – we will fail to detect the row change. When fingerprinting uses hash functions, collisions are very rare, but the tests above used SQL Server’s checksum(), a simple CRC function with an output range of only 232 or four bytes. At the table level, collisions start to become an even chance when the number of inputs equals roughly the square root of the function’s output range.3 This means that checksum() has an even chance to produce a collision after only 216 or 65,536rows. The test tables had millions of rows, so naturally they contained hundreds of duplicate fingerprints.

 

Does this mean functions like SQL Server’s checksum() are worthless for data fingerprinting? Not at all - each key is in its own collision space. It is one thing to say that a 10 million row table contains duplicate fingerprints; it is quite another to say that touching batches produced the same fingerprint for different before and after versions of the same key, as in Figures 8 and 9. This is the only collision that matters for the purpose of data fingerprinting, and the chance that it will happen is much smaller than the chance that multiple rows will share a fingerprint.

 

Let’s do the math. Assume a data warehouse has 1 billion rows captured over a three-year period, with updates occurring to about 1 percent of those rows. That is, it has processed 10 million updates in three years. The chance we will be unlucky enough to have any intrakey collisions is 1 - {[(232 -1) / (232)] 10000000} or about 0.0023, or only 1/4 of 1 percent. Still, that’s big enough for fussbudgets to worry about. And what if 10 percent of the rows in the data warehouse get updated? Most database administrators won’t be able to sleep at night until the odds of a collision are close to zero. That would take a more robust CRC function – an output range of 264, like DB2’s XOR() function, would do nicely.

 

While these simple tests are not conclusive, the results are interesting enough to invite further exploration of data fingerprinting. True, on platforms lacking a robust CRC or hash function, the danger of collisions may remain a barrier to acceptance. It’s too bad a low-collision hash function is not shipped standard in commercial databases. If your RDBMS has one that’s collision resistant enough to suit your needs, you may find that data fingerprinting is a convenient and effective data change management tool.

 

But before we leave the job site today, why not propose an even better solution? If commercial databases offered some sort of row version number, there would be no need for hash functions to support data refresh. Row versioning could be implemented as a two-byte integer stored with each table row. It could get ticked forward automatically at each update, like a tiny odometer. This simple idea would obviate the need to read source rows for data refresh. Instead, batch solutions could simply store the source key and latest version number for each row. When a new batch refresh began, the old number could be compared to the current one. If the current version was greater, the row would be fetched for update in the data warehouse. This solution would produce load times faster than the methods tested here, and would be extremely simple to implement and maintain.

 

References:

  1. Michael Jennings. “Improved ETL Change Detection through Operational Meta Data.” DM Review, December 14, 2001.
  2. Michael Jennings. “Enterprise Architecture View: Fingerprinting Data Warehouse Data.” DM Review, January 2003.
  3. PGP Editorial Staff. “Much Ado About Hash Functions.” www.pgp.com , September 2004.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access