Continue in 2 seconds

Data Format Standards as an IQ Tool

  • May 01 2003, 1:00am EDT

Thanks to a reader who inspired this month's column by asking about the meaning of the terms "standardization" and "remediation" as applied to the work of creating data format standards for the data element "Organization Name."

The IQ Problem

Two problems exist with an attribute such as Organization Name. (While this column discusses Organization Name, the principles also apply to other formatted text fields such as Product Name, Asset Name, Facility Name and other structured text attributes.)

  1. As a single-value attribute, it does not reflect real-world objects. Organizations in real life tend to have more than one name. Most data models and database designs fail to properly model this. Because an enterprise comes to know an organization through any of its name forms and needs to know multiple names (such as legal name, doing business as name, common name or abbreviated name), the data model and database must support this. Otherwise, you get the problems of enterprises having 87 different representations of the name of the company popularly known as DuPont, legally, EI du Pont de Nemours and Company, with names miscaptured as Dupont, Du Pont, E. I. Du Pont, etc.1 My company, Information Impact International, Inc., is known as Information Impact International and Information Impact, with a short name of InfoImpact from the Web site name – An assessment of my mail over a three-month period a few years ago found forty different name spellings and formats.
  2. Without clear data format standards for capturing organization names (or product names), information producers have no guidance for consistency of names across the various organization (or product) records. Without this guidance, duplicate records are easily created for the same real-world organization, fragmenting an enterprise's view of its customers or business partners.

The problems this creates are horrific. Databases with 20 to 40 percent duplication are not unheard of. The real problem is the organization's inability to understand its customers or partners, the process failure caused and the complexity created by having multiple records that represent a single real-world object.
The problems of matching and consolidating records across multiple source files are exceedingly complex, compounded when source files also have different lengths and formats for the data.

The reader's question revolved around when creating a common name value for the different occurrences really "standardization" and when is it "remediation." To answer this, we must define the following terms:

  • Standardization: From a data perspective, data standardization means data has a common name, definition and value set or format. From a process perspective, standardization means "documentation of the best way to do the job." 2 In Kaizen, a quality management method of continuous process improvement, standardization is one of the three foundations of process improvement. Information quality requires both data and process standardization.
  • Remediation: In general, remediation means "the act or process of correcting a fault or deficiency." 3 In quality management, this is "scrap and rework." In information quality, remediation is "information scrap and rework." The data has been created and maintained in a way that it is defective and must be corrected, so that the knowledge workers who require it can use it to perform their work properly.

Standardization means that we define the data and process in a way to create data that minimizes subsequent scrap and rework. To accomplish this, we may establish "standards" in one area of control, such as a data warehouse, and then improve the upstream processes to comply with the new standards.
To qualify as true data standards, they must apply across the entire enterprise or scope of the data impact. If each business area has different "standards" for creating organization name, does the enterprise have data standards? Let's look at a business analogy. If every business area created its own standard budget accounts for developing and reporting on its budget expenditures, does the enterprise have a standard general ledger chart of accounts? Of course not. Is there an enterprise standard format for defining new budget codes? You bet. Do business area managers follow that when creating codes for new accounts? Absolutely.

The same is true for data standards. Data "standards" are standards if they apply to the full scope of the organization affected by that information type, whether it is organization, product or asset.

Root-Cause Analysis

The root cause of inconsistent organization-name formats is twofold: 1) failure to recognize that organization name is not a single-value fact in order to model and design the database properly, and 2) failure to standardize the create processes and provide standards and guidelines, including training for correct data creation.

Recommended Improvements

  • Always model data based on the characteristics of the real-world objects and events about which the enterprise must know.
  • Do not "under" model (i.e., create one attribute that may have multiple values, such as name). This forces information producers to determine which name they will capture, with inconsistency of different – yet correct – name values for the same object.
  • Assure the data model contains the relationships that support the relationships of the real-world objects and events. The real relationship between organization and organization name is one (organization) to many (names), based on name type (legal, known-as name, doing-business-as-name, abbreviated name or nickname).
  • Develop format standards for text and other structured data element values, such as product name, that provide guidelines to information producers on how to create consistency in data name values.
  • Create a validation table to map multiple synonyms of names to a standard name or to map different name values to the one occurrence of the organization.
  • Access the validation table in the source application as one of the edit tests to identify potential duplicate records to prevent creating duplicate occurrences.
  • Provide phonetic edit and validation tests, such as Soundex, to match common phonetic spelling variations of names.

Questions for Improvements

Some questions to ask:

  • Is there a data standards steering or governance group that operates at an enterprise level?
  • If so, does this standards group include representatives of business areas whose data is affected by the standards?

If the answer to both of these questions is yes, work with this group to create or revise data standards and guidelines, including format structure.
If the answer to either of these questions is no, then create a data standards team to create universal and global data standards and guidelines.

Once data standards have been defined and adopted by representatives of the information stakeholders, document them in an enterprise repository or data dictionary that is accessible to all knowledge-workers, and train information producers and knowledge-workers on the data standards and guidelines.

If legacy processes and applications cannot support the new standards, analyze how to capture data close to the standardized data; and map that data value to the standard data value in migrating data for downstream standardized databases.

When data standards are in place, prioritize any data correction to existing data based upon costs of the nonstandard data (cost of process failure, inability to reconcile data, rework activities required and miscommunication, etc.), the cost of standardizing it and the resources available to convert it.

Data standards and guidelines for creating data values are part of the "information product specification" along with the definition and business rules. They are an important part of data definition for information quality improvement.

What do you think?

1. DuPont is a registered trademark of EI du Pont de Nemours and Company.
2. Imai, Masaaki. Gemba Kaizen. New York: McGraw-Hill, 1997, p. xxviii.
3. American Heritage Dictionary.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access