We are extracting data from a legacy system that has summary data in it that we will need in the DW, assuming that the legacy data has some degree of integrity, should we re-derive these counts as functions of ETL to ensure accuracy?


Larissa Moss' Answer: That depends. If you are convinced that the algorithms in the legacy systems produce accurate counts and if your users trust those counts and could tolerate any potential discrepancies between the atomic data and the counts (in case they try to recalculate the counts in the DW and discover they're not correct), you can save yourself a lot of coding, testing and processing time in the ETL. On the other hand, if you always want to ensure that the counts match to the atomic data in the DW or if you want to validate those legacy counts to guarantee to your DW users that they are correct, you should re-derive them in your ETL. Being a DQ bigot, I would suggest the latter. But you really should discuss the pros and cons with your users. I suggest you don't just negotiate this decision with the user for whom you are building this application, but with all other users who will also be using and relying on this data. Your immediate user may be OK with moving the legacy counts as-is, but other users may not.

Sid Adelman's Answer: My question to you is do you know the level of integrity of the data? How bad is the data? If it's not at the level the users need, you need to clean up those counts. The users may opt for speed over quality but when you give them data that's wrong, they will reject your efforts and forget they told you that quality wasn't important. Take the time and effort to make sure the data is correct. Re-derive the counts.

Chuck Kelley's Answer: I would tend to derive them myself but also put the ones from the legacy systems into the data warehouse. That way, I can see what the differences (if any) are and can find out what is happening. Then choose which one is the best in the long run. It will be easier to take out something built in the ETL process then to write a new one. I would hate for someone to do a count and compare to my summary and have different results.

Clay Rehm's Answer: A best practice is to never assume anything especially that the source systems are maintaining integrity. You should re-derive these counts as well as other processes to identify where integrity is comprised or where it may be comprised.

Les Barbusinski's Answer: Generally speaking, it's always best to go to the original source of the information. Importing preprocessed data into a data warehouse always increases the threat of contamination because of misunderstandings over how that data was selected, filtered, transformed and/or aggregated. Also, importing summary data without importing the underlying detail data may eventually prevent you from providing drill-down functionality on your operational and/or BI reports or, worse, cause your drill-down metrics to be at odds with your summary metrics.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access