Companies around the world spend billions of dollars implementing enterprise applications to better integrate their corporate data and increase the overall value of existing information. With high quality data, they hope to realize the significant savings that comes from consolidated data, including increased operational efficiencies, enhanced compliance initiatives and improved customer relationships. However, despite good intentions, most of these applications fail due to inefficient, outdated data.

Before undertaking any data-driven effort, it is essential to have a clear understanding of the integrity of your current data. Data profiling, or data discovery, provides a diagnosis of your existing data so you can begin building a successful data improvement and integration effort through consistent, accurate and reliable data.


Getting Started

Before you begin any data quality improvement initiative, you need to address some key questions:

  • Do you trust the quality of the data you are using in this initiative?
  • Does the data for this initiative conform to the business rules monitoring process you expect to set up later?
  • Will the existing data support the needed functionality?
  • Is the data fit for its purpose, and will it meet the needs of the new application?

Engaging in any data initiative without a clear understanding of these issues will lead to large development and cost overruns or potential project failures. The effect can be incredibly costly, resulting in substandard customer relations, wasted expenses, poor decisions, lost sales and ultimately, failed businesses.
Many businesses and IT managers find that their data is often confusing and disorienting. Often, organizations do not - or cannot - make the best decisions because they can’t get access to the right data. Just as often, decisions are made based on data that is faulty or untrustworthy.

Data profiling is a fundamental step that should begin every data-driven initiative, yet it is often taken for granted. In fact, every enterprise resource planning (ERP), customer relationship management (CRM) or data warehouse application project should start with data profiling. By identifying data quality issues at the front end of a data-driven project, you can drastically reduce the risk of project failure later.

Data profiling is the first step in the data quality process to help you diagnose and repair the problem. Some of the data profiling techniques and processes being used today can be grouped into three major categories:

  • Structure discovery - Does your data match the corresponding metadata? Do the patterns of the data match expected patterns? Does the data adhere to appropriate uniqueness and null value rules?
  • Content discovery - Is the data complete? Is it accurate? Does it contain information that is easily understood and unambiguous?
  • Relationship discovery - Are there relationships across columns, tables or databases? Is there redundant data?

Structure Discovery: Understanding Metadata and Data Patterns

Structure discovery, also known as structure analysis, helps you determine whether or not the data in a column or table is consistent and meets expectations. Many techniques can validate the adherence of data to expected formats. Any one of these techniques provides insight about the validity of the data. Traditionally, a good place to start is by examining the actual state of data against the metadata on that data source.

Validation of Metadata

Most data has associated metadata - a description of the characteristics of the data. Metadata contains information that indicates data type, field length, whether the data should be unique and if a field can be missing or null.

Data profiling tools scan the data to infer this same type of information. Metadata analysis determines if the data matches the expectations of the developer when the data files were created. Has the data migrated from its initial intention over time? Has the purpose, meaning and content of the data been intentionally altered since it was first created? By answering these questions, you can make more informed decisions about how to use this data going forward.

For companies with hundreds or thousands of data sources, it may be necessary to analyze the metadata across systems to uncover similar data sets within the IT infrastructure. For example, if a company has inventory information across multiple systems, a metadata discovery exercise can help a data analyst aggregate these sources for further data quality efforts.

Pattern Matching

Pattern matching is used to determine whether the data values in a field are in the expected format. This technique can quickly validate whether the data in a field is consistent across the data source and whether that information is consistent with expectations. For example, pattern matching would analyze if a product identifier field contains a product number or other characters. Pattern matching would also uncover whether a field is all numeric, if a field has consistent lengths and other format-specific information about the data.

Basic Statistics

You can learn a lot about your data just by reviewing some basic statistics, such as minimum/maximum values, mean, median, mode and standard deviation. Basic statistics give you a snapshot of an entire data field. As new data is entered, tracking basic statistics over time will give you insight into the characteristics of new data that enters your systems. Checking basic statistics of new data prior to entering it into the system can alert you to inconsistent information and help prevent adding problematic data to a data source.

Content Discovery: Validating Rules and Assessing Data Completeness

Structure discovery provides a broad sweep across your data and often points to problem areas that require further investigation. After you analyze entire tables or columns of data using these structure discovery techniques, you then need to look more closely at each of the individual elements.

Content discovery techniques use matching technology to uncover nonstandard data, frequency counts and outliers to find data elements that don’t make sense. In addition, data verification based on specific business rules can verify data that may be unique to your organization.


Organizational data often comes from a variety of sources, such as different departments, data entry clerks and partners. This is often the root of an organization’s data quality issues. If multiple permutations of a piece of data exist, then every query or report generated by that data must account for each and every instance of these multiple permutations. Otherwise, you can miss important data points which, ultimately, can impact the output of future processes. Fortunately, data profiling tools can discover these inconsistencies and provide a blueprint for a data quality technology to address and fix the problems at hand.

Frequency Counts and Outliers

When there are hundreds or even thousands of records that need to be profiled, it may be possible for a business analyst to scan the file and look for values that appear to be incorrect. But, as the data grows, this becomes an overwhelming task. Many organizations spend hundreds of thousands of dollars paying for manual validation of data. This is not only expensive and time-consuming, but manual data profiling is inaccurate and susceptible to human error.

Frequency counts and outlier detection give you techniques that can limit the amount of business analyst fault detection required. In essence, these techniques highlight the data values that need further investigation. You can gain insight into the data values themselves, identify data values that may be considered incorrect and drill down to the data to make a more in-depth determination about the data.

Outlier detection also helps you pinpoint problem data. Whereas frequency count looks at how values are related according to data occurrences, outlier detection examines the (hopefully) few data values that are remarkably different from other values.

Business Rule Validation

Every organization has basic business rules. These business rules cover everything from routine rules (“Address must be valid”) to complex, specific formulas (“If a loan-to-value ratio is .8 or above, the policy must have private mortgage insurance”).

You can check many basic business rules at the point of data entry and, potentially, recheck these rules on an ad hoc basis. Problems that arise from lack of validation can be extensive, including overpaying expenses, running out of inventory and undercounting revenue.

Because business rules are often specific to an organization, you will seldom find data profiling technology that will provide these types of checks out of the box. These prebuilt business rules may provide domain checking, range checking, lookup validation or specific formulas. In addition to the canned data profiling validation techniques, a robust data profiling process must be able to build, store and validate against an organization’s unique business rules.

Relationship Discovery: Data Redundancy and Similarity Discovery

The third major type of data profiling is relationship discovery. This aspect of profiling discovers what data is in use and links data in disparate applications based on their relationships to each other or to a new application being developed. Different pieces of relevant data spread across many individual data stores make it difficult to develop a complete understanding of the data.

Organizations today maintain a massive amount of data on customers, products, suppliers, personnel, finances and employees. Additionally, organizations get data from partners, purchase data from list providers and acquire industry-specific data from other sources.

As a result of the different sources of data, companies typically don’t fully understand much of their data and cannot effectively manage this data. To remedy this prevalent problem, you must understand all of these sources and the relationships of data across different applications.

A successful data quality initiative starts with a comprehensive and honest examination of your existing data and sources. Data profiling gives you the thorough diagnosis you need to begin building a credible foundation of high quality data throughout your organization.

However, it is not a “once and done” project. Proper data profiling must be part of a larger data quality methodology that ties these processes together in a cohesive fashion through an integrated and phased approach.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access