Continue in 2 seconds

Data Profiling in a Compliance World

By
  • Amiya Mansingh, Sreedhar Srikant
Published
  • October 01 2005, 1:00am EDT

Recent regulatory intiatives such as Basel II, Sarbanes-Oxley and HIPAA have led to increasing responsibility for managing enterprise data assets. Data management is increasingly seen as a crucial component of the overall enterprise architecture. Due to new and emerging compliance requirements, enterprise data is now more scrutinized than enterprise applications. Compliance regulations specifically highlight the need for having sound data policies, procedures and processes to ensure data accuracy.

Before we delve into the details of compliance initiatives, let us examine another regulated industry: supply of potable water (regulations enforced by the Environmental Protection Agency). This regulation mandates the water quality level required for it to be deemed potable. Town planners must have an effective plan that encompasses systems, people and processes needed to deliver the quality level mandated by the regulations.

The planning process starts with the identification of potential sources of potable water. Once the sources are identified (depending on quality, growth needs and other parameters) the final set of sources is selected. To ensure water quality, a filtration system needs to be in place, and this could be implemented at the point of delivery or as a centralized system. A centralized approach is typically chosen due to scale, efficiency and control. For the central filtration system to work effectively, city planners design holding tanks where water can flow through a series of cleansing processes before being delivered. Even though most systems are foolproof to ensure highest quality, city planners institute regular monitoring of water quality.  It is safe to say that water quality regulation is a stable and mature process.

In contrast, data quality is far from being an established process. According to a recent estimate by The Data Warehousing Institute (TDWI), poor data quality costs U.S. businesses more than $600 billion a year. Many businesses don't even have a data quality monitoring and management system in place, and the end users are burdened with deciphering the data along with all its impurities. Furthermore, many executives do not even know how adversely poor quality data impacts their enterprise.

Data Profiling Defined

With compliance initiatives focusing on the quality of data, a systematic examination of data is necessary to identify any problems, pinpoint the source of the problems and identify the process changes necessary to rectify the problems. Data profiling is essentially this examination and can identify the "what" and "where" questions associated with data quality. Profiling throws light on the content, context and structure of data, providing information regarding its accuracy and completeness. In this article, we will discuss the process of profiling using the Basel II regulation as a backdrop.

To elaborate the importance of data quality and profiling, let us examine the Basel II compliance initiative. One of the key metrics for estimating regulatory capital, as per Basel II guidelines, is loss given default (LGD). LGD provides the magnitude of loss when a default occurs (expressed as a percentage). It is used along with the probability of default (PD) and other factors such as exposure at default (EAD in dollar terms is the potential credit loss for a given account or pool of accounts) and maturity to calculate the metric called capital requirement.

LGD is typically calculated by using historical averages classified by a combination of the type of debt (loans, bonds, etc.) and debt seniority (secured, subordinate, etc.). Further, let us suppose that a financial institution is using the Advanced Internal Ratings Based (AIRB) approach that uses historical information from previous defaults at the bank. Using an analytical model, let us assume that the bank had identified the following factors to be the determinants of LGD:

  1. Historical averages, which includes:
    a. Amount recovered
    b. Time taken to recover
    c. Costs incurred to recover
    d. Intermediate recoveries
    e. Discount rate used 
  2. Debt type (loans, bonds, etc.)
  3. Seniority (secured, unsecured, subordinate, etc.)
  4. Industry indicator (type of industry, usually referred to by Standard Industry Code or North American Industry Classification System)

It is crucial that these factors are accurately represented in the data assets in the bank to ensure precise calculations of LGD (and eventually capital requirement). Capital is the amount that the bank should hold to account for unexpected losses. A miscalculation of capital due to erroneous data can either result in a penalty or tie up unnecessary money in the reserves (see sidebar on page 56 for details).
Profiling ensures that these factors meet the data quality requirements as prescribed by the Basel II accord and the financial institution itself. We will break down these requirements as pertaining to the form (or structure), content and context of the data.

  • Form refers to the structural properties of the data, such as name, type and format. For example, the format of the data will vary across the enterprise depending on its origination, movements and residency. A date field could be stored as dd/mm/yyyy, yyyy/mm/dd or other variations. Naming standards used for the data, business description, length boundaries, data type and so on constitute form-related requirements. These requirements make it clear that content alone is not the determinant of data quality.
  • Content represents the value stored in the specific data element. The content of data will vary across the enterprise depending on the origination, movement and residency. For example, a Social Security number should be nine digits with not all values being equal to zero. Other content-based issues concern the range of values allowed for the attribute, null ability, data- and value-based rules, referential integrity and so on.
  • Context provides meaning to the data by binding it to its usage environment. The context of data will vary across the enterprise, depending on origination and residency. For example, a credit score of 650 may be average for a credit-lending process, but excellent for the purposes of assigning insurance coverage. Other context-related issues include identifying synonyms used, data- and value-based rules and so on.

Figure 1 illustrates sample validation rules that may be used in defining a profiling activity.


Figure 1: Sample Validation Rules


Data Quality Issues in a Compliance Scenario

Typically, compliance initiatives require data to be pooled together from numerous source systems (systems of record - SORs) for conducting data analysis. Large amounts of data are moved between these SORs. Financial institutions that have grown through mergers and acquisitions have generational technologies causing data inconsistency. Even though industry data standards have evolved, data impediments still exist in most organizations. Data quality issues typically arise in satisfying compliance regulations for the following reasons:

  • Data required for compliance originates in different data sources, each with its own set of nuances (such as naming conventions, data formats, etc.).
  • Data is defined differently across different systems.
  • Data is moved across various systems for performing different types of analyses.
  • Data is resident across different systems, each with their own way of formatting and storing the data.

Data Profiling: Where to Perform?

Figure 2 illustrates a data environment for a typical financial services institution. The SORs are the various source systems that generate the enterprise data. The enterprise data warehouse or data marts are the locations where both internal (SOR) and external data are integrated to create an enterprise view. Profiling is typically conducted at the three target locations identified in the diagram.

Figure 2: Sample Financial Services Data Environment


  1. SOR profiling: SOR profiling is required to identify the authoritative source from which quality data can be obtained. Structural relationships, data content and data anomalies of the SOR database (relational or other) can be identified by this process. The outputs of the profiling can be used for designing the integration layer (extract, transform and load [ETL] scripts), as well as to take any corrective actions on the SOR. Ideally, profiling should occur directly against the SOR database. If that is prohibitive because of performance, availability and other constraints, the data can be extracted to a staging database and profiling can be conducted in the staging area.  In that case, you need to account for any changes to data during the extraction process. Typically, SOR profiling is conducted during the development cycle and then repeated on an occasional basis during the actual production cycle.
  2. Integration Layer (ETL): During the ETL process, after each major step such as extraction, transformation and loading, it is essential to do profiling to ensure that the data meets the requirements. For example, profiling after the extract process can identify any transforms required for cleansing or correcting data. This type of profiling is conducted on an ongoing basis as part of the production cycle.
  3. Data Integrators: Profiling after the data is loaded into the data warehouse or data mart ensures a final check against requirements. Profiling can include checks on column property, column value, referential integrity and checks for compliance with business rules.

Data Profiling Methodology

We will follow a five-step process in laying out an approach for data profiling. (See Figure 3.)

Figure 3: Five-Step Approach to Data Profiling


1. Define: Business requirements such as data standards, business rules, domain definitions and so on are defined and will serve as the basis for profiling the data. Often these standards or definitions can be obtained from a meta data repository. If a repository is not available, then this meta data has to be obtained from data stewards. Initial identification of candidate source systems is also done in this step, as well as definition of metrics for monitoring data quality.

2. Discover: Discovery techniques include the use of statistical methods that verify the content and form, such as: histograms, frequency count, calculation of mean and standard deviation, maximum and minimum values, data type, length and similar characteristics. Context-related discovery reveals the presence of various business rules. Key structures and data affinities (relationships) are also inferred during the discovery phase. Some of these techniques are at a granular level (e.g., field, column or row) and others are applicable at an aggregate level (e.g., table or file). An enterprise may choose to use all data, versus using a sample based on the type of data being profiled, tool features supported and space availability.

3. Analyze: This step is used to understand the findings of the discover phase and compare the findings against requirements identified in the define phase. Interesting anomalies in the data can be further evaluated to find out where the problem may be arising. For example, if we find that we are getting a large number of null or negative amounts from a specific region, perhaps a data entry issue can be the cause. Similarly, if an attribute has a number of outliers, the cross tabulation or histograms can be used to investigate specific cases where the outliers originate and arrive at a conclusion. Typically, if an attribute such as recovery amount is being sourced from multiple source systems, issues specific to SORs may arise. In terms of referential integrity checks, results may indicate that an attribute in the child table may have a different range of values as compared to the parent table, in which case the data update procedures should be investigated.

The outputs of profiling can be used to identify authoritative source systems from the list of candidate source systems defined in the previous step. This can be done based on the quality/accuracy requirements (such as: at least 80 percent of all rows should have valid populated values). If the requirements are not met (from a compliance perspective), a financial institution may have to source the data from a third party or take adequate corrective actions on a given SOR. 

4. Actions: In this step, an enterprise identifies and executes actions based on the results from the analyze phase. Figure 4 summarizes some of the typical actions.

5. Monitor: Monitoring involves examining the metrics to identify the current state as well as the trends of data quality in the enterprise. Profiling-related metrics can include a percentage defect calculation based on total rows profiled and the rows that had errors at a table or file level, and a similar calculation at the column level. These defects can be classified as per the type of profiling check (such as length check, data type check, domain value check, etc.). If the enterprise already has a quality management program in place, such as Six Sigma, then metrics such as defects per million (DPM) can be employed. Metrics can be aggregated to provide a data quality indicator for an SOR or data aggregator (data warehouse).

People and Process

Data profiling does not happen in a vacuum. To achieve success, we need the right people, process, technology and environment. Executive buy-in is a must before starting any large profiling initiative. Change of management processes should be considered before initiating any corrective measures on the data. Data profiling as part of a compliance exercise is a complex initiative. However, we can draw inspiration from other regulated industries, such as in the potable water example provided earlier. With a thorough plan and proper resource commitment, an enterprise can successfully achieve high data quality and compliance with regulatory needs.

Recommendations

Following are some key recommendations that will help in implementing a profiling system for achieving compliance:

  1. Define an overall data management architecture and set of processes for compliance, and highlight the enterprise fit of data profiling and data quality.
  2. Obtain executive buy-in.
  3. Develop policies and procedures around profiling.
  4. Identify people and processes affected by profiling, and ensure their alignment.
  5. Select a tool only after all requirements are clear.
  6. Develop metrics and a process for communicating the profiling results and resulting data quality improvements.

Compliance requirements can be used as an opportunity to improve and implement new procedures to advance the data quality of the enterprise to a new level. Beyond compliance, data profiling will help an enterprise in reducing quality-related costs, improving revenues (making decisions on accurate data), and increasing employee morale (reduced uncertainty over the quality of data).  


References:

  1. The Data Warehousing Institute Study on Data Quality, Feb 2002. Available at http://www.tdwi.org/research/display.aspx?ID=6045
  2. Olson, Jack E., Data Quality: The Accuracy Dimension, Morgan Kaufmann, 2003
  3. Basel II: International Convergence of Capital Measurement and Capital Standards: a Revised Framework. Available at http://www.bis.org/publ/bcbs107.htm.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access