More vendors are now providing data profiling solutions. Some products provide relatively general capabilities while others are embedded within larger application suites with specific data targets. Apparently, a large part of the data profiling capability is now routinely (and, to some extent, trivially) integrated into other applications. Additionally, most major data standardization and matching tool vendors have deployed a data profiling solution through partnership, development or acquisition. Data profiling is moving into the mainstream and becoming a fixture in numerous data management processes, ranging from system migrations, data warehousing and data quality to operational system improvement.

Data profiling is used to explore anomalies within a collection of data sets and to expose potential problems inherent in the data but not explicitly stated. Profiling is also used to review compliance of a data set with its documented meta data, as well as conformance with a data model. Most of the existing vendor products provide some kind of business rule definition, along with auditing and reporting functionality. However, without getting into great detail, data profiling technology consists of three major capabilities. The first, frequently referred to as "column profiling," provides statistics and analysis about the values assigned to the attributes within each column in a table. The second, which has been referred to as "redundancy profiling" or "cross-table profiling," reviews and explores relationships between columns in different tables, with the expectation of discovering foreign key relationships as well as violations of referential integrity constraints. The third, referred to as "dependency profiling" looks for functional dependency relationships that exist across columns within a single table.

There are many of options for tool selection. Of course, the determination of which product is right for a customer is based on their specific needs. For example, a customer looking to profiling to aid in data migration may have slightly different needs than one looking to enhance their ETL process or one looking to develop a data quality auditing function. In addition, different companies may have nontechnical constraints that affect their product choices. Yet objectively, what I find to be a major discriminator between the different products is functional dependency analysis.

A functional dependency within a table indicates a reliance of the values assigned to one set of attributes on the values assigned to some other attributes. It is the basis for discovering embedded relational structure and, consequently, opportunities for normalization, within existing tables. Not only that, existing functional dependencies, when reviewed within the business context, may reveal embedded business rules that may have been buried inadvertently within the data and/or application, but not clearly documented. And while the ability to profile values within a column relies on relatively simple frequency distribution analyses, the ability for discovering exact, as well as approximate, dependencies depends on a much more complex algorithm that is rarely implemented among the vendor offerings.

Interestingly, there are very few (if any) remaining independent vendors selling a data profiling tool; the major ones have been acquired by other data quality tools vendors or service providers. This implies that the business model based on solely selling a data profiling tool appears to have had a limited lifetime. In turn, data profiling technology has become more of a commodity capability. Second, it implies that the value of data profiling is best established within a larger information quality context, whether that is in a broad-brush suite of tools (as Ascential, which purchased the MetaRecon data profiling tool, and Trillium, which bought Avellino, provide) or as part of targeted data warehousing service (as Conversion Services does with its purchase of Evoke Software).

Vendors will include more discriminators in order to remain competitive. These discriminators are likely to manifest themselves as improvements in user interfacing, data access, or auditing and monitoring capabilities. As more customers see the benefits of functional dependency analysis, expect the playing field to level as more vendors incorporate that aspect into their offerings.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access