A common information quality problem exists when the meaning of data stored in historical databases (such as a data warehouse) changes over time. Correct handling of changed data definition is critical for archived data collections or for data warehouses and data marts that maintain historical data that may be compared over time. In this column, I identify the types of data definition change and describe guidelines for determining the proper method for handling those changes.
There are three basic types of definition changes to an attribute or fact and four basic ways to implement them. The three types of changes are: the attribute has a new meaning, the domain or set of values is changed and the business rules for determining the value for the attribute have changed.
The first guideline for changed data definition is that you have a defined process for handling changes to data definition, a basic "configuration management" process. The information resource manager (sometimes called data administrator or data resource manager) is the process owner for this and other information management processes. Approvals required for changes include those of the information resource manager, subject-oriented information architect or information manager and the business information steward(s) (business subject-matter experts) who oversees the data whose definition is changed. The business steward's responsibility is to assure that the definition change is correct from a business meaning to meet the needs of all stakeholders of that data. The information resource manager's responsibility is to assure the changes conform to sound data design and definition guidelines.
The four basic ways of implementing changed data definition are:
- Create a new attribute and field and maintain both old and new fields.
- Replace the attribute and field, clearing out the historical data prior to the change.
- Use the same field with the new meaning, documenting the changes and effective dates.
- Convert or recalculate the historical data to meet the new data definition.
The correct handling varies based on the type of change and the requirements of the processes that operate on the data.
General guidelines for all data definition change management include:
Conduct an "impact" analysis of all business processes and applications to assure no negative impacts or side effects. This includes processes that may not be automated. (It is important to have documented all information stakeholders and processes that require information groups to support this impact analysis.)
Analyze and verify the changed definition meets all existing and potential knowledge workers' expectations before the change is approved. Data must be managed in a way to meet all information customers' needs, not just one business area or a small group of information customers.
Create a new version of the changed object (entity type, attribute, relationship, domain value or business rule) in the repository or dictionary except for minor changes to definition wording to improve communication. Some even document minor changes as sub- versions to assure complete control of the change process. Maintain historical versions as long as data that was created under that prevailing definition is maintained or archived.
Maintain effective dates and end dates of the changes. The ability to analyze patterns and trends over time demands this. A retail company loaded data archived from seven years earlier to answer a legal query. The data was rendered totally useless, merely strings of unintelligible 1's and 0's, because they failed to maintain historical data definition and schema design.
Document those who requested the change and those who approved it.
Communicate in advance to all information stakeholders what change is being made and why, if necessary and when the change will be implemented so all will be able to prepare any changes in their work procedures and/or will not be surprised by what may appear as anomalies in the data they receive.
Specific guidelines for each type of definition change include:
The attribute has a new meaning. If the attribute has a new meaning or definition, it represents a new fact or the previous definition was incorrect. Each attribute, implemented as a data field or column, should represent a single characteristic about a real-world object and event. When data is modeled correctly, the definition of that fact should remain relatively stable. The definition of "birth date" of a "person," or "medical diagnosis [code]" of a "medical encounter" will not change meanings. A definition may be refined, but not changed unless it was incorrect in the first place.
If the definition was incorrect and you find frequent data definition meaning changes, you have a defective data definition process. Analyze the root cause and address improvements to eliminate faulty definitions during requirements and data-modeling activities. Document and keep a summary of all instances of corrective changes along with the time, costs and problems created in correcting the definition, as this represents part of the cost of nonquality data definition.
If the definition change actually represents a different meaning, you have a new attribute. An "order placed date" attribute definition may be changed from "the date the order is entered into the system" to "the date the customer places their order with us." This definition "change" actually indicates two different facts. If some knowledge workers need to know the "old" attribute while others need the new, you should create a separate attribute and field. The names of the attributes should be different, as they represent two different facts. Name the attributes with intuitive names that are clear to all stakeholders. Maintain both attributes and fields. If the old definition is obsolete and is not required by any processes and is not likely to be required in the future and the historical data is not needed, then it may be deleted. If the old definition is not needed by current or future processes but the historical data is required for processes that analyze data during that time period, maintain the historical data with null values for that attribute from the time of discontinuance.
The domain or set of values is changed. Domains change over time as new codes (e.g., new medical procedures are developed or new countries such as Slovenia and the Czech Republic emerge) or when ranges change due to inflation or growth (e.g., product prices or census figures increase).
For volatile code attributes whose domain values may change over time, such as "medical procedure code," maintain effective dates and end dates for code values. Develop reference tables containing the code values and effective dates and end dates so that data integrity tests validate the codes according to the effective dates. This assures an insurance claim will not be approved for a procedure that has been discontinued or for a procedure that is no longer eligible for coverage for a given policy type.
Define all domain code values to document and assure understanding of all meanings.
Confirm that domain values are consistent with the definition of the attribute. Many times new values are proposed to make one field mean more than one type of fact to prevent having to change the data structure. One company added a code value of "C" to "gender code" to identify dependent children of their customers. They saw this as a convenient way to record a different type of customer without having to modify the structure of a database not designed to support relationships among customers. As a solution, this creates more problems than it solves. It forces applications and simple queries to contain logic to prevent mishandling or getting the wrong answer, and it causes transformation processes to have more complex conversion logic to migrate the data to decision-support databases.
For domains of code values, the rule of thumb is to maintain the most atomic level codes the enterprise must know (such as for "standard industry code"). You can always aggregate into more generalized, summarized groups; but once you lose the most atomic- level facts or aggregations, you can no longer reconstruct them.
For externally imposed code values such as those of regulatory bodies, the rule of thumb is to maintain the most atomic level of detail that is required to be known by your enterprise or that is likely to be required in the future by the regulatory body. Again, you can always aggregate into more generalized summaries if you have the base data. This is much less expensive than having to go back to reconstruct the finer levels of detail after the data is initially captured. For example, regulatory bodies constantly change the categories of the "ethnic origin" attribute.
If there are multiple views with different levels of aggregation of coded domains, create reference tables to map more atomic-level codes into aggregates for each of the views. For example, marketing may need only generalized industry codes to market specific insurance products; but underwriters need detailed industry codes to rate the risk of a specific company.
The business rules for determining the value for the attribute has changed. Because business rule changes occur frequently in the world of business, flexible and easily accessible documentation is required. Examples include new criteria to determine premium customers or new formulas for calculating the daily average balance of a loan or daily stock price for an equity traded on a stock exchange. It is most critical to have flexible implementation of volatile business rules to minimize cost of application maintenance. Business rule changes tend to impact data warehouses significantly as they affect trend analysis and comparisons of data over time.
Reuse an existing field using the new business rules from a point in time forward and maintain the historical data with the former business rules if the meaning of that field is semantically the same. You can also reuse an existing field using the new business rules if both "average daily balance" calculations are average daily balances but with different formulas and if no knowledge workers require the old business rule or calculation. Document the different business rules with effective dates and provide the equivalent of an asterisk on a label to a footnote reference where the different business rules or calculations are defined. This assures that knowledge workers are aware of what may appear as anomalies in the data.
Reuse an existing field, clearing out the previous historical values if the meaning of that field is semantically the same and if the data computed under the old business rules no longer needs to be known by any knowledge worker.
Recalculate the data and update and replace the historical data if comparison of data will be distorted, if it is possible to recalculate the historical data, if the base data is retained for original facts that must continue to be known and if future recalculations may be required. (For example, when stocks split, their original price must be computed in order to properly reflect actual share price performance. A stock that had a two-for-one stock split would show as having a 50 percent decrease in price per share without having a recalculation.)
Create a new field, maintaining the current field, if the business rule change actually defines a new fact and both facts need to be known.
Create a new record or row if the meaning of that field is semantically the same, if the business rule represents one "view" of the information (for example, the enterprise has its business rule for calculating "average daily balance" for a loan, but a regulatory agency has determined it needs to see "average daily balance" for outstanding loans calculated according to another formula) and if it is possible that other business rules or calculations of the same data will be needed by other groups of knowledge workers (for example, a state regulatory body). This is also true when different business divisions may have loans that are regulated by different bodies, but the enterprise wants to "normalize" them with its own business rules. There could be several ways of calculating one "attribute" by the different groups. A qualifier, or one of the primary key attributes, would identify the view in which a specific business rule applies. In the example of "average daily balance," the qualifying attribute might be named "balance calculation method type" if the formula is one of different formal methods or "calculation method organization ID" if the formula is defined and controlled by a specific organization.
There are "correct" ways for handling changes to data definition from a quality viewpoint. They are based on three components. The first is the requirement the customers have of the information. The second is the requirement for data to accurately reflect the characteristics that need to be known about the real-world objects and events, and the third is the requirement to know and document changes so that processes and applications can properly validate the data when capturing it and interpret the data when using it.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access