When IT professionals first embark on a data quality initiative, a common question they ask is, "What industries or market verticals have the best data quality?" Underlying that question is the desire and need to establish a data quality benchmark. All firms want to know how they are doing relative to their peers, and most data managers want to hear encouraging words that their industry is achieving quality goals more easily than others.

The desire for data quality benchmarks is even more acute because of the dearth of data quality success stories published in the media. There are two reasons for this palpable absence. First, organizations do not like to air their dirty laundry. Publicizing a data quality success story in many ways is good public relations, but for some markets, such as health care and financial services, touting improved quality is implicitly admitting you had a previous problem, and no one wants to think their hospital or bank had problems with their patient or investor data. Second, no firm wants to give up a hard-earned competitive advantage. I've worked with a number of clients who told me point blank I could not publicize their successes because it would educate their competition about how they gained advantage.

There is another issue lurking behind the question, "What industries or market verticals are doing the best with data quality?" The questioner is looking for confirmation that his or her company is in a market vertical fertile for the adoption of data quality practices. Fortunately, data quality practices, methods, processes and technology are generic. They span industries and markets, and are equally applicable. I know this will be disappointing for some people to hear because we all want to think we are special and the industry we work in is unique and requires custom solutions. The fact is, data quality processes and practices perform equally as well for one industry as for another.

That is what enterprise information management (EIM) is all about - enabling an organization to create a comprehensive strategy to ensure they are using trustworthy information. Data quality is a critical part of EIM, and like EIM, data quality is not just a technology. A successful data quality initiative is 80 percent people and process. A firm can create unique data categories, types and objects that may temporarily defy existing cleansing technologies, but data quality is mostly about people and process. Even a unique, proprietary product SKU code is a candidate for cleansing. In a proprietary SKU code there are specified patterns to the data. Those patterns can be represented by business rules, and the rules can be loaded into parsing, standardization and correction routines in the form of program parameters.

Why are data quality practices applicable horizontally, across all markets? In the simplest terms, data is facts about things. Data is made for human consumption, and we humans like our data served up in the same general way, regardless of industry or market. Whether it is financial, telemetry, environmental, product or customer data, we want it broken down to its discrete components, fielded out and grouped into records that create a full picture of all the available business information at hand. We want the similar records to be grouped in tables and related records. Events and other related data should be linked to the records in question, for example, a customer address record is linked to the customer's credit history and also linked to their purchase history. The ultimate purpose behind mining unstructured data is to move the important facts into a structured environment.

The practice of ensuring data accuracy applies to all industries and markets equally. Every firm needs data to manage its operations. Data, and hence its quality, is foundational to every industry. A false perception exists that gaining a competitive advantage depends on specialized treatment of the data. We can dissolve this misperception by simply exposing the standard process everyone uses to ensure high quality data. First, it starts with measuring and analyzing your data. What are the defects and what caused them? Data profiling solutions are designed for quantifying the defects and providing metadata to help analyze the cause of the defects.

Second comes the process of parsing the data into its individual components as in Figure 1. Here the client's data (two different records with just Product  No. and Description fields) started out bundled into, as in the case of the first record, one unformatted contiguous string - Bolt 2,5 x 20 mm Coated Zn. Because we can understand it mentally, we can define a set of rules to load into a data quality package that will programatically parse the data into its requisite components, in this case, product, dimension, type and compound. Once the data is componentized, it can be standardized, which is the third step, such as converting m.m. to MM. The fourth step is to ensure the data is accurate, in other words correct. To do this you need some form of trusted data source to compare the records against. In the case of master data management (MDM) that would be a master parts or product list. In our example in Figure 1, the elements found in the type column were compared to the truth data and stainl was corrected to stainless. The same thing could have been done if a wrong value was found in the dimension column.

Figure 1: Parsing Data into Components


The fifth step in what we will call the data quality function framework is enhancement. Enhancing the data means you can add additional facts or attributes that increase the value of the records for specific downstream operations. Using our example, we could append a preferred provider code for the parts from an industry association list. What these steps in the framework accomplish to this point is the preparation of the data for matching and consolidation. Everyone has duplicate records, and supply chain management operations are no exception.

The data in Figure 2 has been extracted, as part of the MDM activity, from an equipment assets database. The goal of this MDM project is to consolidate all various equipment parts cataloged across the enterprise into one database so parts procurement can be consolidated. By increasing the accuracy and oversight of the data, the firm can decrease the number of purchases and increase their volume, thereby gaining pricing leverage with its vendors. The problem is the duplicates. To identify the duplicate records, a matching operation, the sixth step, is run. The left two columns of Figure 2 contain match codes posted by the operation. Wherever there is an identical value in the group number column, the matching operation has determined, according to the user's business rules, that those records are duplicates.


Figure 2: An Example of Duplicate Records


Using the match codes, the seventh operation in the data quality framework -consolidation - can either eliminate the dupes or consolidate them into best-of records as seen in Figure 3.


Figure 3: Best-Of Records


Here the consolidation function was programmed to ignore extraneous data, such as the usage of the steel plates, and eliminate those extraneous records. Whether the data is supply chain information, addresses, personal names, diagnosis codes or equity descriptions, the process is the same. Business rules can be defined to identify any element regardless of industry, and even special truth data unique to a market can be created to support corrections.

An additional proof point that data quality cuts across market verticals is the fact that so many data quality projects are driven by BI, customer relationship management (CRM), customer data integration (CDI) and data integration (ETL) operations, in addition to MDM. They are all deployed across industries and are not captured by the monopolistic domain of any one market.

Where verticalization (industry specialization) comes into play is in the application of standard data quality functionality against custom vertical data sets, such as ISO country codes, Department of Justice compliance lists or USPS address delivery points. For example, compliance solutions are marketed to firms seeking to identify their customers against any one of dozens of domestic and international watch lists. While the data may be unique to government agencies and needed by firms to comply with identity resolution regulations, the underlying techniques and technologies that match those vertical lists to horizontal customer files is applicable to all industries.

As I said earlier, 80 percent of a total data quality solution is people and process. However, we live in the information age where we have megabytes, gigabytes and now terabytes worth of data. These data volumes defy cost-efficient manual cleansing and matching. So while technology may be only 20 percent of the solution, it is indeed a critical 20 percent. Why do I say 20 percent? Because when the total effort of a complete solution cycle is considered, an information quality solution cycle, from the research in the awareness phase to the installation in the implementation phase, we learn that the bulk of the time is invested in researching the problem, educating stakeholders, designing the solution, improving the processes, developing a strategy and planning the project. When we finally deploy technology to manipulate our mountains of data is in the implementation phase, at the end of a lengthy process.

At the end of the information quality solutions cycle, through the use of EIM technologies (metadata management, data integration, data quality, etc.) that data is delivered to applications and business operations cleanly and efficiently, formatted and standardized as required by those operations, regardless of the market or industry vertical. Consider the example of a high-tech manufacturer that saved 50 percent of their direct marketing budget, almost $12 million dollars, by consolidating duplicate customer records and eliminating redundant and wasted brochure mailings. I still can't identify the market or product, but their story applies to any firm that markets to customers. So the next time you hear someone ask which industry gets the most value from data quality, understand the answer is: it's important to all industries. Data quality is a competitive advantage regardless of market or function.  

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access