Continue in 2 seconds

Data Profiling Made Easy, Part 1

  • June 01 2007, 1:00am EDT

Companies are still spending a tremendous amount of money to share data across the enterprise. Guess what? The data is still not that good! As technology drives us closer and closer to real-time data, the data anomalies and quality issues are dispersed across the organization. Information is still driving strategy development in most organizations. Most companies focus on improving customer satisfaction, integrating all customer data, understanding customer churn and running day-to-day operations more effectively. In fact, data quality is and has been a major concern for most companies.

Incorrect or inaccurate data in mission-critical applications can lead to numerous problems, including:

  • Inability to understand all the products owned by a customer,
  • Ineffective marketing campaigns and sales contact lists when contact information contains errors,
  • Inefficient data integration strategies for master data management (MDM) and customer data integration (CDI) initiatives, and
  • Bad decisions based on incomplete, bad quality data.

The consequences of poor data quality can range from loss of revenue to increased costs from missed opportunities in sales and/or customer retention.
The good news is that methodologies and technologies exist to address these issues. The first step in fixing your data quality issues is to understand the problem and where the problem originates. Data profiling can help in accomplishing these tasks. Some of the benefits of data profiling are:

  • Validation of business rules and processes.
  • Understanding the data prior to programming any type of integration or ETL.
  • Increased speed of project delivery time because data anomalies are known in the design phase, not in the development phase. Problems found in the development phase cost more than those found during design.
  • Quality of the project can be enhanced by better data and understanding prior to development.
  • Reduced risk of project failure because mitigation for issues are considered early in the design phase.

Data profiling (also known as data assessment) analysis is used during source system analysis (part of data acquisition) for validation and testing of those relationships discovered during the extract, transform and load (ETL) process (or integration) and pre-implementation activities. Source system analysis includes the following steps:

  • Identifying source data systems and availability of the data (system inventory),
  • Gathering the table or file layouts,
  • Estimating data volumes by table or file,
  • Determining the time range of data (if required for business intelligence),
  • Determining any other selection or exclusion criteria,
  • Obtaining sample data or setting up batch data pulls into a staging or profiling database,
  • Completing data profiling functions on each source or in combinations of data from multiple sources, and
  • Documenting and maintain the profiling results to share with the ETL and integration processes.

Data profiling is gaining an understanding of the data relative to some sort of quality specification. Consider executing the following profiling analysis as steps in any data quality or profiling program:

  • Structural analysis of the data stores (each table);
  • Pattern analysis for given fields within each data store;
  • Range or threshold analysis for required fields;
  • Analysis of counts like record count, sum, mode, minimum, maximum, percentiles, mean and standard deviation; and
  • Cross-table (or cross-system) relationship identification.

Structural analysis checks formats of the data, for instance, data type information and sparsity analysis. Structural analysis is used to analyze each and every field within a table or source of data. This analysis ensures that numbers are what they should be, within a tolerance or range of values. Other analysis performed within structural analysis includes checks for:

  • Nulls or sparsity,
  • Spaces or carriage return,
  • Frequency or number of distinct values and percentage for each field, and
  • Uniqueness of the primary or alternate key structures.

Structural analysis shows the business analyst and/or programmer the number of records that are null. If a field such as state is null and the business requirement is to create a mailing list, then some additional programming to input ZIP codes will be required to make this data complete and null free.
Structural analysis also shows frequency (number of records of a certain value and the percentage of the whole associated with that value). This is very useful information when states (or other values) are placed in the data store either spelled out completely or abbreviated. For example, if our objective is to make sure all states are in abbreviated format and we have 122 records where the state of Colorado is spelled out, then, again, some additional programming will be required.

Data profiling is just the beginning of a total data quality solution. Look for my future columns that will explain the remaining data profiling analysis steps.  

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access