The idea of invoking data quality processes at extract, transform and load (ETL) time is a compelling one. It is a good time to apply validation, transformation, filtering or standardization in the interest of data quality. Disk I/O is one of the most expensive processes actions and because that has already been completed, it makes sense to subject the data to multiple processes prior to writing it back or putting it out through a network interface. Converging includes:
Design time integration: The functions are available in a palette of transformations at the developer's finger tips to support a diversity of transformations, including those relevant to data quality.
Execution time integration: The processes are applied in the application that is generated and promoted to production.
Meta data integration: The information is stored in the local meta data repository, but then is able to be interchanged thanks to a variety of bridges (available at a modest extra fee) with other tools (such as a query and reporting interface, data modeling or data mining tool) in a federated design and execution environment.
This convergence of ETL and data quality (DQ) technologies has been in progress at least since 1999 when Oracle acquired Carleton and its data quality product Pure. SAS's acquisition of DataFlux followed apace a few months later. Ascential then acquired Vality in the Spring of 2002. Group 1 reversed the direction of the trend of ETL vendors acquiring data quality vendors. Here the DQ vendor acquired the ETL vendor. Group 1's primary focus had been on data quality in the direct marketing vertical, and Sagent was initially an end-to-end business intelligence software provider with an ETL tool. See Figure 1 for a summary.
Figure 1: Convergence of ETL and Data Quality
However, in spite of significant convergence, the merging of features and functions across data quality and ETL will remain incomplete. Some clients will find that transforming operational data into a star schema format is disconnected from issues of data quality, which are best addressed upstream in the transactional system. Others will find that addressing data quality issues requires semantic analyses and content updates that are significantly different than the structural and syntactic transformations in which ETL tools excel. Furthermore, many ETL tools now accept a near real-time data feed from message brokers such as MQ Series. However, in practice, actual deployments of ETL tools remain batch oriented, whereas data quality supports real time. Finally, differences exist in the form and uses of meta data. The meta data of the ETL tool is a grammar-like repository of data models and data structures, whereas the meta data of the DQ tool is a dictionary-like repository of valid contents. Two separate problem spaces (and markets) exist here, even though it is often operationally convenient to perform related functions at the same time. Both categories of tools DQ and ETL will continue to address separate requirements and will continue to exist separately in spite of productive collaborations.
One open question for clients is: Will the convergence of the two technologies, partial though it may be, provide end users additional technology "at no extra charge" following the model that has been characteristic of software innovation, or will the perception of additional value be employed by the vendor to propose a price increase? When the two vendors remain separate entities (Informatica/Trillium), then separate license agreements and separate fees are to be expected, though discounts based on local circumstances are always possible. When the two become one (Group 1/Sagent or SAS/DataFlux), the opportunities for flexible pricing are enhanced. In either case, clients should do their homework to determine the internal costs and benefits of their own data warehousing, data transformation and data quality applications. If the buyer has not yet made a commitment, he/she enjoys the maximum leverage in negotiating for additional technology at no extra charge, whether ETL or DQ. This may include training or a long-term maintenance contract, locking in low prices or premium service where the commitment warrants. If the costs for upgrading are high, clients should present a case to the vendor for additional support, discounts and related "investment protection."
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access