Data profiling (DP) has always had a special relationship to information quality (IQ) and has often, though not always, been represented by Forrester as a subset of the larger IQ market. The market has now validated the hierarchical relationship again with the announcement of the Harte-Hanks/Trillium acquisition of Avellino in February 2004. Meanwhile, Firstlogic has shipped data profiling capabilities with IQ Insight 2.3; and DataFlux enhanced the data profiling capability of its version 6.1 IQ technology. In a preemptive move, Innovative Systems is promoting Synchronous, its customer data hub, and will continue to support Avellino's Discovery, which it had previously sold as Innovative Discovery.
Best of breed data profiling will now be inbred, which will enrich the gene pool of the software DNA of the assimilating enterprise. Data profiling capabilities such as redundant data identification, parent-child relationship analysis, data validation and sampling have always been a part of data standardization. Vendors such as Vality (now a part of Ascential), Trillium, Similarity Systems, Innovative Systems and Firstlogic have been featuring their data profiling functions in briefings to Forrester industry analysts, including this author, for years. What is happening now that makes this an inflexion point? Three causal factors follow:
Reality has now caught up with the rhetoric. Data profiling capabilities - to which lip service has always been devoted - are being strengthened by vendors in their latest product shipments with more powerful functions and a greater diversity of choice. As is often the case, the promises preceded the results. In this case, the capabilities are now delivering on their promises.
Data standardization is being differentiated from information quality. Both data profiling and data standardization are subsets of information quality. Standardization can result in the loss of information unless it is based on an understanding of how the standards interact with what is given by the raw data. Data profiling is being integrated with data standardization so that the one leads naturally to the other in the order of implementation.
Defect inspection is giving way to a design for information quality. There is a world of difference between inspecting the content of every individual data element and designing a process that produces the correct output by design. The latter is pursued as part of an integrated methodology for information quality.
Data profiling determines what are valid values in a population of data values where validity is uncertain. It is essential to have reports on frequency analysis, word counts, patterns and related occurrences of tokens and labels. These are usability and productivity enhancers. Data profiling tools should be able to report concisely on questions such as:
- What are the values contained in the data elements?
- What keys are inferred from what is in the data?
- What is the proposed parsing of the free-form text field?
- What allegedly different data elements are actually aliases (synonyms) for the same data element, even though in different files?
- What are the data dependencies, constraints and rules implied by the data?
- What is the normalized logical and physical data model or relational design that represents the business rules of the existing, analyzed (legacy) data?
In the final analysis, semantics is messy because the real world is messy. A dictionary of special words, phrases and patterns remains essential in identifying noise words, suffixes and prefixes, such as "in care of," "Dr." and "Ph.D." It is a semantic problem to know whether the word "church" occurring in a free-form text refers to an individual named Alonzo Church, Church Street in Evanston, Illinois, or the First Church of God. Such dictionaries will continue to be a part of the solution, whether in the form of cartridges for text mining or dictionaries, more narrowly defined, in data standardization.
The Future of Data Profiling
Data profiling is the first step in a readiness assessment for information quality improvement. You need to know what you have and what you are up against prior to engaging and transforming it. In the future, the results of the profiling and analysis activity will be incorporated into a meta data repository of values for further inspection and validation. This will be a source of inputs to other downstream enterprise processes and technologies such as data warehousing, data mining and data standardization. By certifying upfront that the data inputs conform by design to quality standards, the downstream applications will literally be defended against externalities due to variations in data quality (though obviously internal system design defects will still be an issue). After having profiled and parsed the semantics of an opaque data element or text, a logical next step is to standardize the results. Standardization leads directly into the functionality provided by information quality tools.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access