Data Profiling: How to Get Started Without Business Rules

  • April 02 2009, 1:25pm EDT

When the master data management show rolls into an organization, a data quality program, targeted at the master data, is usually part of the effort. Typically, one of the first tasks of a DQ program is to establish a baseline measurement of the Master Data quality. Baseline measurement of quality involves profiling the data, which, in turn, requires the definition of business rules that specify the expected characteristics of the data.

Unfortunately, enterprises new to MDM and DQ can have difficulty defining business rules on their data, especially in cases where MDM is centralized at the enterprise level. This article will discuss some of the reasons business rules are difficult to define, then explore some basic profiling that can raise enterprise understanding of problems in the data and ‘start the conversation’ about business rules.

Business Rule Difficulty

There are many reasons why those first sessions on business rules contain blank stares and uncomfortable silences. Few typical ones include:

Software package implementations. Many organizations have harrowing stories to tell about major software package implementations. Often the conversion of legacy data into these packages is a significant problem for the project. The vendors of these packages know this and may offer configurations of the programs that allow “soft” validation of data entry. Usually there is an intention to “tighten up” the data entry after conversion, but other priorities for resources intervene and the soft validation remains. It takes time to define business rules; it takes time to implement them in the data structures of the package. Data controls, if they exist at all, may become procedural only and undocumented.

Domain conflation. Organizations with multiple product divisions often share the same software but use it in different ways. For example, a manufacturing enterprise may use the same inventory package for two similar but differently configured products. Some of the shared code domains that classify product may actually contain two different classification schemes. Bringing a shared domain under a unified set of controls requires coordination and communication between two divisions that usually don’t work together. The resulting situation is that there isn’t a single set of rules for the domain – it’s hard to define a wrong result. In fact, the rules for one division may contradict the rules of the second division.

Temporal conflation. Organizations reorganize the responsibilities for data control over time. Each time control changes, individuals start new patterns in the data and forget what the prior data owners have done. Sometimes the new owners want to improve the data quality, so going forward there are new rules but old data is not converted. After several successive ownership changes, no one in the organization can explain the patterns of the historical data. Stewardship quality can also deteriorate over time; by the time the MDM effort arrives, the assigned staff may have only a rote understanding of their data.

Domain distribution. Some enterprises distribute data across copies of systems. For example, an international manufacturing company might keep regional copies of its inventory system. This effectively distributes a code domain within the enterprise. Stewardship quality may vary across regions. In addition, there may be variations in the conduct of business. These differences can introduce variance in the classification of the same part entity until it is difficult to establish a unified enterprise business rule. 

Interaction of Conflation, Distribution and Implementation

It’s not unusual for these conditions to reinforce each other across time. First, the software package goes in with soft data entry. Domain conflation can occur at implementation time or in the early post-implementation period. As the software is used over time (in some cases, over decades), temporal conflation occurs. Domain distribution quality may begin clean but degrade through temporal conflation.

So, in these situations, how does an enterprise get the business-rules ball rolling? It turns out that the cart (profiling) can sometimes go before the horse (business rules).

Profiling for Business Rules

If an enterprise cannot define its business rules, data profiling can offer a place to start using basic statistical analysis techniques and very high-level expectations about the data. Almost all data profiling tools have the ability to report measurements of completeness (nulls), uniqueness and consistency (correlations between values in two or more domains). Analysis of the distributions of these qualities on data domains can help staff understand their data quality problems and introduce them to the effort of drilling down into their data for rules.


It is a truism that identifiers should never be null, but code domains can often be a different story. Often null code values conflate multiple meanings, e.g., not applicable, not ready or an entry error. These values can be added to the domain by analyzing the set of records with null values in greater detail. Here is a simple profiling protocol:

  • Run a profile and get the percentage of records with null values in the domain,
  • Extract the null value record set for analysis and
  • Subset the record set by additional codes or shared characteristics.

At this point, set up work sessions to examine the subsets, starting with the largest and working your way through them. At this level of scope, it is often possible for staff to see a common theme in the data, articulate specific questions and follow up with subject matter experts. Down the road, these subsets can be targeted over time as part of the data quality plan.


Patterns of nonunique values in code domains can be quite revealing. Data profiling tools can typically tell you the number of distinct values in a code domain and give you the number of records that contain each value. This simple statistic can be the basis for several interesting explorations of the data.

Small Sets

Let’s say you have a 500,000 record set that has a code domain with 500 distinct values. If the code were evenly distributed, we would expect each value to code 1,000 records. Of course, in real data, there is a distribution curve of these record subsets, and we can examine this distribution for anomalies.

For example, there may be code value with only one record. Typically, codes are intended to group data, so groups of one are unusual. We don’t, however, have to stop at groups of one. In a 500,000 record set, we might establish a threshold size for an anomalous group; it could be .1 percent of the set (500) or even .01 percent of the record set (50). Typically, the distribution of set size will run from 1 up to the threshold (1 to 500 or 1 to 50). 

Large Sets

Conversely, we can also have anomalies at the other end of the distribution. In a 500,000 record set with 500 code values, it would be unusual to find a single value coded to 25 percent or 50 percent of the records. Here again, one can set a threshold for an anomalous group, e.g., any group greater than 50,000. The larger the group, the more likely it is not homogenous for the coded characteristic.

As before, we can now set up work sessions to examine the subsets, selecting the groups above or below the threshold, as appropriate. Common patterns typically emerge that lend themselves to articulation of business rules.


Many data profiling tools will tabulate correspondence between two (or more) domains. The domains can be resident on a single record or on different records in different data structures. In this case, the records must share a common join value. The measurements typically show the percentage match between a primary domain and the second, matched domain. For example, Domain A, value 123 matches to Domain B, value 234, 100 percent. When the correlation is less than 100 percent, the tool gives the frequency distribution of value matches, e.g., 234, 50 percent, 345, 40 percent, 456, 10 percent.

High Correlation

A variety of correlation patterns are possible, several of which can help staff discover business rules. For example, a code distributed across multiple system copies (domain distribution) should correlate 100 percent on the same entity instance. For example, Part A in Region 1 should have the same code value as Part A in Region 2. If it does not correlate 100 percent, then the frequency distribution will involve two or more values. Typically, there will be an unbalanced distribution of matched values, e.g., 123, 99 percent, 234, 1 percent.

Low Correlation

Conversely, some code domains should have very low correlations. The anomalous cases will show very small frequency distributions of matches. For example, code A might show an average of 10 values of code B matching (a broad frequency distribution). Values of code A that show only one or two matching B code values could be classed as anomalies for investigation.

In some cases, the entity characteristics coded may partially correlate for subsets of the overall entity population. Examination of the small match sets will confirm which characteristics are crossing.

As before, we can now set up work sessions to examine the anomalous groups and generate specific questions that will help discovery of the underlying business rules.

For organizations new to MDM and DQ, data profiling can be a bit threatening to business staff, especially if they are told it can only be done with business rules (which they have never done). Don’t despair – you can use basic high/low, large/small patterns of statistical analysis to get their feet wet. Before you know it, the business staff will be off and running, drilling into those anomalies with a vengeance!

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access