Can you quickly and efficiently identify changes in the source data coming to the data warehouse?

When data changes cannot be easily identified due to limitations in the source system (no time stamps or logging), updates to the data warehouse can take longer periods of time, thus pushing the limits of your processing window. This can have a detrimental effect on service level agreements you have with your users, corroding confidence in the data warehouse.

This data identification dilemma often results from extract, transform and load (ETL) strategies that take a brute-force method to identify updated source data. This leads to increases in the time and costs companies spend to maintain the data warehouse. Fingerprinting source data as it is loaded can provide an alternative means to effectively identify changes to the data warehouse. These fingerprints or encoding methods can be used to efficiently compare source data to the data warehouse, avoiding multiple and/or long variable length column comparisons, which can consume limited resources.

The first fingerprint method is the cyclic redundancy check (CRC), a data-encoding method (non-cryptographic) originally developed for detecting errors or corruption in data that has been transmitted over a data communications line. The sender encodes the data being transmitted into a code that is transmitted with the data. The receiver recalculates the CRC using the data received and compares that to the CRC code originally transmitted. If the CRC values are not the same, it indicates that errors occurred during transmission. The CRC is redundant in that it adds no information about the data. Use of CRC does come with the risk of data collisions; different data is encoded to the same CRC value resulting in incorrect ETL processing. This risk can be reduced by using a large CRC, 32-bit or higher (32-bit equates to more than 4 billion unique occurrences or 232). The International Telecommunication Union (ITU), formally CCITT, has standard algorithms available on the Internet for calculating CRCs (see also ISO standard 3309). Many programming languages (e.g., Java J2EE, Perl) and DBMSs (e.g., SQL Server's CHECKSUM and BINARY_CHECKSUM) have CRC functions available for use in encoding data strings. Additionally, many methods are available through the Internet.

The second fingerprint method is hashing, which processes a string of any length through a mathematical algorithm, which computes a fixed length hash value or message digest. Hash values differ from CRCs because they are collision free. It is computationally infeasible, not necessarily impossible, to produce the same hash value from two different strings (2128 to 2160 possible hash generated values versus 232 to 264 for CRC values). The hash value acts as an identity tag because it succinctly denotes the string from which it was derived. Hashing is commonly used to index and retrieve items in a database because using the shorter hashed key results in faster searches. Some of the leading hash functions are message digest (MD5) and secure hash algorithm (SHA- 1). These functions were originally designed for digital signature applications where a large message has to be "compressed" in a secure manner. MD5 is faster at processing a string due to its 128-bit hash value versus SHA-1, which is more secure due to its 160-bit hash value. These hash functions and others are available on the Internet.

Regardless of the identity method chosen, incorporation into the data warehouse environment is the same. For ETL processing, the fingerprint is used to direct slowly changing dimension (SCD) processing for a dimension table where many columns need to be compared to determine if data from the source systems should be processed as an insert or update to the dimension table. This technique is also very efficient when used in cases where a relevant column in the dimension table is very long and variable in length (such as a URL in clickstream or Web log data).

During ETL processing for the dimension table, all relevant columns needed to determine change of content from the source systems are combined and encoded through use of a fingerprint algorithm. The fingerprint value is stored in a column on the dimension table as operational meta data. During subsequent ETL processing cycles, new source system records have their relevant data content values combined and encoded into fingerprint values during ETL processing. The source system fingerprint values are compared against the fingerprint values already computed for the same production/natural key on the dimension table (see Figure 1). If the production/natural key of an incoming source record is the same but the fingerprint values are different, the record is processed as a new SCD record on the dimension table. Fingerprint values should be smaller and easier to compare during ETL processing, resulting in increased performance.

Figure 1: Example Dimension Table

Candidates for use of fingerprinting include dimension tables that contain numerous data columns (more than Figure 1) or large variable- length columns, thus providing quicker more efficient change detection. How the fingerprint value is generated is dependent on your ETL processing tool and the encoding algorithm used.

In the example provided in Figure 1, the relevant columns from an employee dimension table for one particular employee are used to calculate hash values of the rows. The corresponding hash values would result depending on the function chosen (see Figure 2).

Figure 2: Example Corresponding Hash Values

The benefits of fingerprinting data warehouse data include faster change detection of source data and more efficient processing for those cases where change is not easily discernible from the source system. Use of fingerprinting offers warehouse administrators the ability to readily scan source data feeds against the warehouse without the need to compare large and/or every relevant column. Hashing can offer significant advantages over CRC for fingerprinting data in the warehouse, especially when the number of rows in a dimension is high, making data collision a concern.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access