Chuck Kelley's Answer: I would check the DM Review Web site (www.dmreview.com) and Larry English's Web site (www.infoimpact.com). You should easily find what you are looking for.
Sid Adelman's Answer: The following is excerpted from Data Warehouse: Practical Advice from the Experts by Joyce Bischoff and Ted Alexander. There are a number of indicators of quality data.
- The data is accurate - This means a customer's name is spelled correctly and the address is correct. If the marketing department doesn't have the correct profile for the customer, marketing will attempt to sell them the wrong products and present a disorganized image of the organization. When data on a company vehicle is entered into the system, it may be valid (a vehicle number that is in the database), but it may be inaccurate (the wrong vehicle number).
- The data is stored according to data types - If a field is defined as packed decimal, all the instances of this field will be stored as packed decimal.
- The data has integrity - The data will not be accidentally destroyed or altered. Updates will not be lost due to conflicts among concurrent users. Much of this is the responsibility of the DBMS, but proper implementation of the DBMS should not be assumed. Robust backup and recovery procedures as implemented by the installation are needed to maintain integrity. In addition, operational procedures that restrict a batch update from being run twice are also necessary.
- The data is consistent - The form and content of the data should be consistent. This allows for data to be integrated and to be shared by multiple departments across multiple applications and multiple platforms.
- The databases are well designed - A well-designed database will perform satisfactorily for its intended applications, it is extendible and it exploits the integrity capabilities of its DBMS.
The data is not redundant - In actual practice, no organization has ever totally eliminated redundant data. In most data warehouse implementations, the data warehouse data is partially redundant with operational data. For certain performance reasons, and in some distributed environments, an organization may correctly choose to maintain data in more than one place and also maintain the data in more than one form.The redundant data to be minimized is the data that has been duplicated for none of the reasons stated above but because:
- The creator of the redundant data was unaware of the existence of available data.
- The redundant data was created because the availability or performance characteristics of the primary data were unacceptable to the new system. This may be a legitimate reason or it may also be that the performance problem could have been successfully addressed with a new index or a minor tuning effort and that availability could have been improved by better operating procedures.
- The owner of the primary data would not allow the new developer to view or update the data.
- The lack of control mechanisms for data update indicated the need for a new version of the data.
- The lack of security controls dictated the need for a redundant subset of the primary data.
In these cases, redundant data is only the symptom and not the cause of the problem. Only managerial vision, direction and a robust data architecture would lead to an environment with less redundant data.
- The data follows business rules - As an example, a loan balance may never be a negative number. This rule comes from the business side and IT is required to establish the edits to be sure the rule is not violated.
- The data corresponds to established domains - These domains are specified by the owners or users of the data. The domain would be the set of allowable values or a specified range of values. In a human resource system, the domain of sex is limited to "male" and "female." "Biyearly" may be accurate but still not an allowable value.
- The data is timely - Timeliness is subjective and can only be determined by the users of the data. The users will specify that monthly, weekly, daily or real-time data is required. Real-time data is often a requirement of production systems with online transaction processing (OLTP). If monthly is all that is required and monthly is delivered, the data is timely.
The data is well understood - It does no good to have accurate and timely data if the users don.t know what it means. Naming standards are a necessary (but not sufficient) condition for well-understood data.Data can be documented in the meta data repository, but the creation and validation of the definitions is a time-consuming and tedious process. This is, however, time and effort well spent. Without clear definitions and understanding, the organization will exhaust countless hours trying to determine the meaning of their reports or draw incorrect conclusions from the data displayed on the screens.
The data is integrated - An insurance company needs both agent data and policyholder data. These are typically two files, databases or tables that may have no IT connection. If the data is integrated, meaningful business information can be readily generated from a combination of both the agent and policyholder data.Database integration generally requires the use of a common DBMS. There is an expectation (often unfulfilled) that all applications using the DBMS will be able to easily access any data residing on the DBMS. An integrated database would be accessible from a number of applications. Many different programs in multiple systems could access and, in a controlled manner, update the database. Database integration requires the knowledge of the characteristics of the data, what the data means, and where the data resides. This information would be kept in the meta data repository.
- The data satisfies the needs of the business - The data has value to the enterprise. High quality data is useless if it's not the data needed to run the business. Marketing needs data on customers and demographic data, Accounts payable needs data on vendors and product information.
- The user is satisfied with the quality of the data and the information derived from that data - While this is a subjective measure, it is, arguably, the most important indicator of all. If the data is of high quality, but the user is still dissatisfied, you or your boss will be out of a job.
- The data is complete - All the line items for an invoice have been captured so that the bill states the full amount that is owed. All the dependents are listed for an employee so that invoices from medical providers can be properly administered.
- There are no duplicate records - A mailing list would carry a subscriber, potential buyer or charity benefactor only once. You will only receive one letter that gives you the good news that "You may already be a winner!"
- Data anomalies - From the perspective of IT, this may be the worst type of data contamination. A data anomaly occurs when a data field defined for one purpose is used for another. For example, a currently unused, but defined field is used for some purpose totally unrelated to its original intent. A clever programmer may put a negative value in this field (which is always supposed to be positive) as a switch.
Larissa Moss' Answer: I can recommend four books; all chuck full of data quality principles:
- Improving Data Warehouse and Business Information Quality, by Larry English, published by John Wiley & Sons, ISBN 0-471-25383-9.
- The Data Warehouse Challenge, Taming Data Chaos, by Michael Brackett, published by John Wiley & Sons, ISBN 0-471- 12744-2.
- Data Resource Quality, by Michael Brackett, published by Addison-Wesley, ISBN 0-201-71306-3.
- Quality Information and Knowledge, by Kuan-Tsae Huang, et.al, published by Prentice Hall, ISBN 0- 13-010141-9.