Continue in 2 seconds

Should data be changed in the data warehouse and/or data mart making it different from the source system it came from?

By
  • Larissa Moss, Chuck Kelley, Clay Rehm
Published
  • December 04 2000, 1:00am EST

Q:  

Should data be changed in the data warehouse and/or data mart making it different from the source system it came from? What are the pros and cons of making the change in the data warehouse/mart vs. the source system?

A:  

Clay Rehm’s Answer: This depends on your definition of "changed." You will always transform and convert data to make it easier to use and understand within the data warehouse (DW). For example if, in the source system, there are codes for gender such as "M "and "F", the DW would use "Male" and "Female" instead. If your definition of "changed" means it is now accurate, then two things need to happen. First, you would store the accurate information in the DW; second, you would create a process that would fix the problem in the source system. Often, data quality issues within the source systems are not discovered until that data is populated and queried against in the DW. It is important that the DW support team maintains a DW of high quality and keeps communication lines open to the support personnel of the source systems.

Sid Adelman’s Answer: If I understand your question, you want to know if your data warehouse/mart can be different than your source systems. Yes, it will be. The codes may change, the data will be cleansed, transformed, aggregated, integrated, summarized, derived and otherwise significantly changed.

Larissa Moss’s Answer: A data warehouse and data marts always have two objectives: to improve the quality of information (i.e. clean up our source data mess and integrate the data into one logical business view as it exists in the real world), and deliver data in a usable and timely fashion to be used for strategic decision making. Just putting a fancy tool on a new database full of (still) dirty data only delivers the current data mess fancier and faster. There are a myriad of problems with our source files, especially if they are flat (e.g., VSAM files) and old. We reused fields, we misused fields, we redefined fields, we ignored fields (missing values), we’ve imbedded logic into our fields, we used intelligent keys, we sometimes put the "kitchen sink" into one field (A means type of customer, L means status of sale, X means account closed, etc.), and the list goes on. To suck and plunck this type of data from the source files/databases into the data warehouse or data mart target databases would not address its first objective of "cleaning up the mess." As to whether or not the cleansing should be done in the DW or in the source systems, we all would very much like to see it done in the source systems. However, in many cases it is not as simple as adding an edit check to the data entry program. In many cases it would require major changes to the file structures and the programs and, in some cases, entire rewrites of systems. It took companies 20+ years to bite the bullet on Y2K (because it wasn’t cost effective to address it until business were threatened to go out of business). What are the chances that operational systems would get rewritten because of a "kitchen sink" field?

Chuck Kelley’s Answer: By "changed in the data warehouse," I am going to assume that you are talking about during the transformation process and not after the data is loaded. The answer lies with the use of the data. I believe that the data warehouse always has the most correct data in it. Others say if the users cannot query the operational system and the data warehouse and get the same results, the confidence in the data warehouse is compromised. Let’s say you have an operational system that always (since the beginning) has a measure that is 7% off (the real value should be the measure + 7%). Everyone knows it. Everyone takes it into account when perusing the number. Now, when we move the data into the data warehouse, do we add the 7% into the data? I say yes. Others say no. Therefore, the use of the data comes into play. The question centers on whether there has to be a correct correlation between the source and the target or not. If so, then you have to carry the source data as is. If not, then add the 7% since the data warehouse should always have the most accurate data. You could always carry both. If you are using the data for trend analysis, it does not really matter. The trend would be the same just based on a different number scale (3 – 10 versus 5 – 12) I know, the real answer is to correct the source system and let the data flow into the data warehouse correctly, but the reality is that rarely happens. We need to have a plan for dealing with known issues as these.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access