Performing extensive extract, transform and load (ETL) may be a sign of poorly managed data and a fundamental lack of a cogently developed data strategy. I realize this is a provocative statement; but, in my experience, there is a direct correlation between the extent of redundant data, which is a serious problem, and the amount of ETL processes. When data is managed correctly as an enterprise asset, then ETL is significantly reduced and redundant data is completely eliminated. This leads to tremendous savings in maintenance and greater efficiency in new development while improving data quality.
Let's examine why ETL exists and the root cause of redundant data. ETL gained in popularity as companies began to outgrow antiquated legacy systems. As functionality was moved from the legacy system to more robust, open-systems architecture, ETL played an indispensable role in moving the data. Unfortunately, many companies failed to completely retire their outdated systems. As a result, they are now maintaining duplicate data and the ETL processes that create it.
This problem was further exacerbated as companies developed systems within functional silos. The application-specific approach took root, in which the database is designed to accommodate the system and the needs of an individual group or department. According to this methodology, every new system requires its own database. As a result, data is copied from system to system. Additionally, the data is transformed because data structures are different among the various systems that were not developed using a common enterprise data model.
Packaged software has also contributed to the ETL revolution and the proliferation of redundant data. Because each software package has its own proprietary database to support the application functionality, ETL is employed to move data between systems, which results in more duplicate data and costly ETL development and maintenance.
Counting the Costs
In Figure 1, the arrows represent ETL (or the flow of data). Data is propagated from the product system into warranty, finance, purchasing, sales and eventually the data warehouse. Not only is the data extracted and loaded, it must also be transformed, because the data structures between systems are completely different. The result is complicated and unnecessary ETL development and maintenance of redundant data.
Figure 1: The Flow of Data
This problem is compounded when the data is propagated back to the source system in order to synchronize the data that is no longer synchronized between systems precisely because it was copied! The inevitable conclusion is poor data quality and high maintenance costs.
To further exacerbate the problem, if the product database in this example changes - for instance, a new field or table is added - many of the maps must also be changed that move data from the source to a target. One minor structural change in the source can create a maintenance nightmare in the ETL maps and target databases - a lot of IT expense with no value added.
These costs are not trivial. As explained by Larry English, a leader in information quality, "The IS staff is busy maintaining, on average, tenfold redundant databases and the redundant applications or interface programs that recreate or move the data."1 Ouch!
In my estimation, the only legitimate ETL activities in this example are the data warehouse interfaces. This is because a good use of ETL is to support operational reporting and analytics, where an acceptable amount of redundancy is necessary.
The solution to the problem is rather straightforward, but it requires a new way of thinking - a paradigm shift, if you will. The answer is a single, sharable enterprise-strength database for each major subject area. The key is to design each of these databases to meet the needs of all of the information consumers. No more functional silos! With the advent of new technology, there is absolutely no need to propagate data from one operational system to another, thereby creating unnecessary (and expensive) redundancy. Applications can now access data via shared services from common databases.
To make this happen, information producers must fight the "this is my data" syndrome. Information is a corporate asset and does not belong to one group or department. Therefore, databases must be designed and implemented with the entire information value chain - both producer and consumer - in mind. The entire company must be involved in defining entities and attributes and modeling relationships around major subject areas. The prevailing way of thinking must be, "This is our data!" and not, "This is my data."
To support this new way of thinking that data is a corporate asset, applications cannot be designed in a vacuum. IT must re-engineer applications to utilize common, shared databases centered on the significant subject areas of the company. This is frequently referred to as the hub-and-spoke data architecture. In this data-centric paradigm, the database, not the application, is the center of the universe.
An organization can expect to derive many benefits by eliminating redundant databases and those ETL processes that support them. Some of the benefits include cost savings in maintenance, more efficient and timely development, and improved data quality.
By retiring redundant databases, maintenance time and money is significantly reduced because there are fewer databases to maintain. Additionally, the ETL processes that support the redundancy can also be eliminated, thereby generating further savings in maintenance.
Rather than focusing their efforts on time-intensive, costly maintenance, IT organizations can concentrate on new, value-added development. With an intelligent data strategy that utilizes shared data resources, there are fewer databases to build and ETL maps to code. Protocols that access data can be shared among the various applications and only need to be developed once. Consequently, the development lifecycle is streamlined, which translates into savings in both time and money. The IT department is quickly able to respond to changing business drivers that require technology solutions.
Real-time data is absolutely essential in the information age. Fortunately, real-time access to the data is a natural consequence of shared data resources. By accessing the data directly in the host system, rather than copying it, the information is always current and up to date.
Data quality is another area an organization can expect to improve by removing redundant databases. Some experts argue that data redundancy is the number-one cause of data quality problems. I'm not sure if this is entirely correct, but I am certain that eradicating data redundancy will significantly ameliorate data quality.
There is a very simple rule I employ to determine whether a company has an efficacious data strategy. I begin by counting the number of ETL processes and redundant databases in the company. As a general rule of thumb, the greater the count, the less likely it is that the company is managing their data correctly.
The result of poorly managed data is far too much IT time, money and resources performing unnecessary, cost-added development and maintenance rather than producing value-added services based on sound data management techniques.
1. English, Larry P. Improving Data Warehouse and Business Information Quality. Hoboken, NJ: John Wiley & Sons, 1999.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access