In a corporate information factory model, how does the 3NF enterprise data warehouse capture slowly changing dimensions if in the source system the dimensional attributes get updated (Type 1)?
Steve Hoberman’s Answer: There are two general approaches that the data warehousing processing can take to capture changes when the source system does not keep track of these changes. The easiest approach would be for the source system to send an indicator along with each source system record explaining whether the record is an update, insert or delete. Then the data warehousing processing needs to "turn off" the most recent record and add the new record in the case of an update or delete and just insert a record in the case of an insert. The more difficult (and much more common approach unfortunately!) is to completely compare the source file with the data warehouse dimension and use this comparison to determine what changed. Both of these solutions take on an added complexity when there is not a clean one-to-one relationship between source file and data warehouse table (e.g., one flattened customer file compared to a 3NF set of customer tables). To simplify the logic, many organizations introduce a layer in their staging area that mimics the source files and therefore allows and easy comparison to identify changes. Good luck!
Chuck Kelley’s Answer: I think there is some confusion about how all this fits together. The source systems would probably not have type 1 slowly changing dimensions, since they are part of the production environment. They certainly have things that change but not in the same manner as talked about by slowly changing dimensions.
Now, I think that your question is about how does the enterprise DW (3NF) capture slowly changing dimension since the source system updates the data. What we do is extract the information from the production environment (either by reading the database, the logs, etc.) and then compare it to something. This something is either in the staging area or the enterprise DW. If there has been a change, then it processes as normal. If it has not changed, then it is skipped.
If your source system changes rapidly and does not keep history, then the granularity of your fact table will need to understand that. If you do daily extracts and a particular dimension value changed twice in one day, you will not get the first of the changes. Is that bad? Maybe. Maybe not. That will depend on your requirements and source data analysis and your extraction method.
Nancy Williams’ Answer: You are correct in your observation that it is not always easy to detect changes in the source system that result from slowly changing dimensions. Before determining how you will handle the change record in the enterprise data warehouse, the first step is to detect that a change has occurred in the source system. If you are lucky, the source system will have date/timestamp and/or codes that provide source system meta data for inserts and updates. If not, you may be able to rely on log files that provide an audit trail of change records. Finally, if there is no way to identify change records in the system, you may need to resort to comparing records from the source to the records already in the data warehouse. This technique does not work well when there is a high volume of records involved. Once you are able to detect changes, you can then determine if you are going to update the data warehouse using Type 1, Type 2 or Type 3 techniques for slowly changing dimensions.
Scott Howard’s Answer: First you must examine your source system application’s update and delete characteristics and combine those with your warehouse’s history requirements and, of course, your budget. Only in completely modeling and understanding these attributes will you succeed. This will be better understood by the time you finish this answer. In general, the more demanding the history requirement and concentrated the update activity, the more expensive and complex the implemented solution. Quite often the most demanding history requirements are best satisfied with tools from your database provider.
If you discover that your warehouse applications require complete histories with the highest level of integrity, you will need to capture all changes at the source. You will then have to consider your many options of accomplishing such because some options will disrupt your source systems more than others with the least disruptive solutions tending to be the most costly, monetarily speaking. The solution least disruptive to the source applications requiring no change to those applications and providing minimal to no operational compromise is to capture the changes by reading recovery or transaction logs. You are already logging changes for integrity and recover purposes, why not use those same logs to also capture changes for your warehouse? That’s the good news, the bad is that these logs are difficult to read and reconcile and that they are controlled by your database vendors. These same vendors do not guarantee these structures stability from product release to release and often change them for operational efficiencies. Therefore, a sophisticated system that you develop and implement today could be destroyed by your vendor implementing a more efficient logging structure in their next fixpack. Don’t look for sympathy or relief from these vendors should you attempt your own log read application because the same vendor most likely sells replication technology that reads the recovery or transaction logs. This is your best option because as the logging internals change so must the vendor’s log capture tools freeing you from code maintenance other than ensuring that your versions and fixpacks are current. IBM’s Data Propagator Capture and Oracle’s Asynchronous Replication are example of such products. Several other vendors also offer log read utilities and facilities as part of their end-to- end replication or ETL suites. If you have any of these tools, check their capabilities.
If you are using messaging queuing to update your tables, you could easily manipulate the queues and pick up the changes in route. This is perhaps the lowest cost and a low disruptive approach, but you must be using message queuing or similar technology.
An alternate, cheaper but more disruptive approach to log capture that can also provide complete capture capabilities is the trigger-based approach. You can develop one yourself and they require no change to the source applications. Just implement an Insert, Update and Delete trigger to "copy" the changed data to your own staging table and use that staging table as your replication source. However, the required three triggers could slow your source system’s transactional performance, how much depends on the database vendor’s trigger efficiency which is beyond your control. Also be sure to include any previous copy data in your trigger definition should your histories require such. As yet another alternative, you could also modify your source applications into write twice applications, noting their own changes in a second staging table.
Less demanding snapshots or "good enough" histories can easily be implemented by modifying the source tables adding a timestamping column. Just add a column of timestamp or date datatype (depending on your RDBMS) setting the options to NOT NULL WITH DEFALT CURRENT TIMESTAMP (DB2 semantics. Check for correct syntax for your system). Your applications will write to your tables as usual with this new column automatically keeping track of the last write timestamp. You now read the source tables at convenient times recording all rows whose timestamp is greater than the last time you scanned the table. The exposure of this approach is that it is inappropriate for applications that constantly update a limited set of keys. These "hot spots" could result in incomplete or missing histories.
Let’s say that a single key’s row values are updated several times between your source table scans. This would result in capturing only the last update. Did you also consider how you’ll recognize the deleted rows where there are no timestamped records to read? Some type of exception or outer join operation would be required. It should be evident that the log read or triggered capture approaches are worth their additional costs.
Joe Oates’ Answer: I am not sure exactly what your question is. A 3NF enterprise data warehouse (EDW) does not have dimensions, per se. At least not in the sense of a Kimball product dimension, which is usually a flat hierarchy or a highly denormalized table, such as the CRM Customer example in the 2 nd edition of The Data Warehouse Toolkit.
I think that your question alludes to the fact that most data modeling courses say that one should not model history in 3NF data models. However, in a 3NF EDW one of the primary objectives is to store time-variant data values. Therefore, associative tables, resolving many-to-many relationships between a normalized set of tables representing a person or a thing (Product) and some characteristic or value range that can change over time, are the usual way to solve this situation. These associative tables, in addition to containing a foreign key for the person or thing as well as a foreign key for the characteristic or value range, must also contain information indicating when the information in the row is applicable. Though I have seen this done with a single date and a "current" flag, a more flexible way is to use a start date and an end date. Among reasons for doing this is that it makes it easier to handle many kinds of queries that would otherwise require complex logic or database date functions. A convenient convention to use is to have the end date column contain zeros to indicate the current row.
In most transaction processing systems that I have seen, the current value is overwritten by a new value. There are several ways that the fact that there has been a change can be made known to the ETL. If the source system utilizes a relational database manager and database logging is turned on, the ETL processing can read the database log and determine whether an attribute is to be updated. This is a very desirable situation. Another desirable situation is to have a transaction log, sometimes called an audit log. A not desirable situation is to have neither of these and no way to easily tell if something has been changed in the source system. In this case, you may have to compare field by field today’s contents to yesterday’s contents. In any case, you would have to update the end date in the appropriate associative table existing row with a valid date later than the start date and add a new row to the associative table to reflect the new information. ETL isn’t easy.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access