When the master data management show rolls into an organization, a data quality program, targeted at the master data, is usually part of the effort. Typically, one of the first tasks of a DQ program is to establish a baseline measurement of the Master Data quality. Baseline measurement of quality involves profiling the data, which, in turn, requires the definition of business rules that specify the expected characteristics of the data.
Unfortunately, enterprises new to MDM and DQ can have difficulty defining business rules on their data, especially in cases where MDM is centralized at the enterprise level. This article will discuss some of the reasons business rules are difficult to define, then explore some basic profiling that can raise enterprise understanding of problems in the data and start the conversation about business rules.
Business Rule Difficulty
There are many reasons why those first sessions on business rules contain blank stares and uncomfortable silences. Few typical ones include:
Software package implementations. Many organizations have harrowing stories to tell about major software package implementations. Often the conversion of legacy data into these packages is a significant problem for the project. The vendors of these packages know this and may offer configurations of the programs that allow soft validation of data entry. Usually there is an intention to tighten up the data entry after conversion, but other priorities for resources intervene and the soft validation remains. It takes time to define business rules; it takes time to implement them in the data structures of the package. Data controls, if they exist at all, may become procedural only and undocumented.
Domain conflation. Organizations with multiple product divisions often share the same software but use it in different ways. For example, a manufacturing enterprise may use the same inventory package for two similar but differently configured products. Some of the shared code domains that classify product may actually contain two different classification schemes. Bringing a shared domain under a unified set of controls requires coordination and communication between two divisions that usually dont work together. The resulting situation is that there isnt a single set of rules for the domain its hard to define a wrong result. In fact, the rules for one division may contradict the rules of the second division.
Temporal conflation. Organizations reorganize the responsibilities for data control over time. Each time control changes, individuals start new patterns in the data and forget what the prior data owners have done. Sometimes the new owners want to improve the data quality, so going forward there are new rules but old data is not converted. After several successive ownership changes, no one in the organization can explain the patterns of the historical data. Stewardship quality can also deteriorate over time; by the time the MDM effort arrives, the assigned staff may have only a rote understanding of their data.
Domain distribution. Some enterprises distribute data across copies of systems. For example, an international manufacturing company might keep regional copies of its inventory system. This effectively distributes a code domain within the enterprise. Stewardship quality may vary across regions. In addition, there may be variations in the conduct of business. These differences can introduce variance in the classification of the same part entity until it is difficult to establish a unified enterprise business rule.
Interaction of Conflation, Distribution and Implementation
Its not unusual for these conditions to reinforce each other across time. First, the software package goes in with soft data entry. Domain conflation can occur at implementation time or in the early post-implementation period. As the software is used over time (in some cases, over decades), temporal conflation occurs. Domain distribution quality may begin clean but degrade through temporal conflation.
So, in these situations, how does an enterprise get the business-rules ball rolling? It turns out that the cart (profiling) can sometimes go before the horse (business rules).
Profiling for Business Rules
If an enterprise cannot define its business rules, data profiling can offer a place to start using basic statistical analysis techniques and very high-level expectations about the data. Almost all data profiling tools have the ability to report measurements of completeness (nulls), uniqueness and consistency (correlations between values in two or more domains). Analysis of the distributions of these qualities on data domains can help staff understand their data quality problems and introduce them to the effort of drilling down into their data for rules.
Completeness
It is a truism that identifiers should never be null, but code domains can often be a different story. Often null code values conflate multiple meanings, e.g., not applicable, not ready or an entry error. These values can be added to the domain by analyzing the set of records with null values in greater detail. Here is a simple profiling protocol:
- Run a profile and get the percentage of records with null values in the domain,
- Extract the null value record set for analysis and
- Subset the record set by additional codes or shared characteristics.
At this point, set up work sessions to examine the subsets, starting with the largest and working your way through them. At this level of scope, it is often possible for staff to see a common theme in the data, articulate specific questions and follow up with subject matter experts. Down the road, these subsets can be targeted over time as part of the data quality plan.
Non-Uniqueness
Patterns of nonunique values in code domains can be quite revealing. Data profiling tools can typically tell you the number of distinct values in a code domain and give you the number of records that contain each value. This simple statistic can be the basis for several interesting explorations of the data.









