What method do you use to integrate and standardize data for the data warehouse environment?
One of the first technology decisions for a firm planning to develop a data warehouse (DW) is the selection of a method for performing extract, transform and load (ETL) processing. Firms often struggle with whether to use an existing development method or purchase a product from the marketplace. Internally developed proven data integration methods may already exist in the information technology area, thus complicating the selection process. Conducting a proper ETL product evaluation requires substantial time and resources. Business requirements, budgets, time lines and politics are often the key factors in determining which direction a firm takes with its selection.
This column explores some of the criteria and considerations a firm should use when selecting an ETL method for the DW environment. Some of the initial criteria for choosing an ETL method include:
Facilitate expedient data movement and improve quality of the data.
Reduce the cost of data integration processing through savings from reduced system utilization plus development and testing costs.
Significantly reduce the data errors.
Reduce training time and learning curve.
Enable standardization and definition of all data.
Ability to create documentation.
Provide productivity gains from current processes.
Take advantage of a meta data-centric repository both for ETL and integration with our products (e.g., data modeling, data quality, data profiling, reporting).
Build versus Buy: A purchased ETL product offers several advantages from internally developed programmatic methods in the areas of meta data capture/integration, maintenance, standardization and documentation. From a maintenance perspective, the purchased ETL tool comes with a user interface for the developer that provides editing, prompting, syntax checking, database management system (DBMS) access, security and project organization (directory/file/component) typically from a single access point. The purchased ETL tool will usually come with some form of meta data repository structure to capture source, target and mapping meta data (technical, business, operational). Most ETL tools on the market have various means of documenting the ETL project and processes in a graphical form and through prebuilt reporting templates. In addition, from an ROI perspective, a purchased ETL product can also be used for other data integration tasks such as database conversions, interfaces and near real-time business transaction processing.
Even though a purchased ETL product offers many initial advantages, some firms may decide to start small and slow and develop an internal data integration method. An internal development method may work for your firm, depending on the size of your data warehousing project, number of data sources, complexity of ETL transformation rules, ETL window processing time (with recovery slack time) and ability to effectively perform change management of the processes. Careful consideration and monitoring of the changing business environment is required with this method. Future business requirements, performance and management needs may compel you to revisit this decision, possibly resulting in extensive rework. Additionally, be careful how you decide to code your internally developed ETL transformation processes. Constructing the processes to depend on specific DBMS functionality will limit your options to change databases in the future as your data integration needs grow.
XML: Current ETL strategies need to accommodate translation of XML transactions (DTD, schema) for integration and use in data warehousing environments. This is due to increased use of XML messaging for integration between applications due to easy implementation, ability to be self- describing and neutral format. Increased use of XML to standardize various industry business processes for application integration (HR: HR-XML Consortium; SCM: RosettaNet) will provide additional opportunities for simplifying data integration into the warehouse. This is especially true for operational data stores (ODSs) that require near real-time transaction detail.
Performance: In order to maximize performance of the extract and/or load processes, the developer and the ETL tool must to be able to take advantage of specific utilities and functions of the database such as: the bulk load/unload utilities of your DBMS; virtual co-servers or nodes for parallel load/unload features; the fragmentation strategy used by the DBMS to distribute data from a single table across multiple disks to avoid I/O bottlenecks; and interaction with the DBMS optimizer.
Data Quality: The ETL method should support standard integration methods used individually or in combination to improve data quality in the data warehouse environment. Data validation should be able to be performed on data originating from a source system to identify and mark poor quality records for elimination. Validation should include use of lookup tables (e.g., state or country) or data type checking (valid numeric, date and null) to qualify source records for inclusion in further processing. Records that do not qualify should be able to be excluded from conversion or rerouted to exception processing. Next, records should be able to be cross-matched against several sources by same/similar key fields (e.g., SSN). Because most firms do not use the same keys across all of their systems, fuzzy matching can be used on similar columns from each source system, looking for closely matching patterns such as similar last names (e.g., Smith versus Smyth) or phone numbers (e.g., 312-555-1212 versus 312-555-1221). Next, new record values should be able to be selected based on the most common values occurring across the various sources (e.g., last name Smith exists in eight of the 10 systems). Alternatively, new values should be able to be selected based on a prioritized hierarchy of the source systems that, for example, use call center customer address data over billing unless call center data for customer X does not exist.
Meta Data: There are many methods and techniques for sharing meta data between products (data modeling, DBMS, ETL, reporting); no one tool controls the entire process. Most ETL products concentrate on sharing of technical meta data (table name, column, length, data type, etc.) but neglect business meta data (table name, column name, column description, lookup values and description, etc.). This lack of support for business meta data is most apparent when attempting to deliver critical definition information to the end user through the front end (reporting, portal, etc.). Most of the vendors capture the technical meta data about the source-to-target mappings (at a table/column level not down to domain level) being developed in their respective tools. Some even have methods to capture operational meta data during ETL processing to analyze post- processing statistics.
Use of these criteria when evaluating an ETL method for the enterprise can help you avoid many costly mistakes and issues commonly encountered.
What ETL method or product does your firm use? Send me an e-mail with the subject line: Our ETL Method. I will post the results of this unofficial survey in a future column.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access