In the recent past, the architectural design of data warehouses has predominantly been ETL (extract, transform and load). However, with emerging technologies taking center stage, the architectural vision of data warehouse systems has gone through a radical shift. This shift has led to a new data warehousing process, as the previous data models were not equipped to handle the exponential increase in user data. Often, businesses were also unable to meet their critical data-related business needs due to their inability to incorporate changes into their large single data repositories. In turn, that translated into a longer waiting periods for reports that were requested by business users. Eventually, organizations ended up specifically dedicating resources to "extract data,” performing additional and often redundant analytics and reporting activities outside the data warehouse itself.
Thus, several manually intensive and expensive processes have to be established to fill in the gaps created by the limitations of the architecture design of the ETL-based data warehousing systems. The strategic solution was the implementation and transition to the data warehouse design technology ELT (extract, load and transform).
Importance of Data Warehouses in Banking and Financial Services
Banking and financial services involve numerous data intensive processes, from withdrawing money from an ATM to suggesting the most lucrative investment portfolio to a high value client. This makes data warehouse testing extremely critical in this domain. Figure 1 shows some examples of scenarios that involve data warehouse processes in specific financial services domains.
Figure 1: Data warehouse system dependability in financial services sub domains
ETL, Focusing on Desired Data Output
The ETL data warehouse technology is primarily based on the report requested by the business user. The approach is top down. First, a decision is made regarding the desired output data (i.e., the format and content of the report demanded by the user) and then the required data is extracted, transformed and loaded, as seen in Figure 2. With the focus on output, the input data needs to be processed to remove any data anomalies, ensuring high quality of data for transformation and loading.
Figure 2: The standard ETL process
When data warehouse systems were initially conceptualized, the ETL-based systems met all the business needs, as the focus was on designing the output first. Moreover, data that was relevant only for the processing needs was required. However, these systems had their limitations when needs related to future expansion arose, like the re-modification of output data, which would need a complete redesign and change to the input architecture of the data warehouse itself. This would be a very expensive and cumbersome process for an organization to achieve with the traditional ETL process.
Data Warehouse Technology ELT
The concept of ELT data warehouse technology came into existence because of the high business demand for error-free, high-quality input data in data warehouse systems. Here the approach is bottom-up. The data is initially extracted to a staging area, where the business rule and integrity checks are performed.
With ELT, the load and transformation processes are decoupled from each other, as shown in Figure 3. Hence, the development is split into smaller manageable chunks, which make project management, change management and the development of data warehouse systems much simpler.
The ELT data warehouse systems overcome the shortcomings of ETL systems, which paves the way for this transition across industries, especially in financial services and banking.
Figure 2: The ELT Process
Achieving Thorough ETL to ELT Conversion Testing
If an organization decides to transition from ETL to ELT, from a QA perspective, a lot of planning and strategy is required to ensure that the conversion is thorough and successful. In any typical re-engineering project, especially one that includes an ETL to ELT conversion, the following validation points need to be looked at (Please note that the ETL process is mentioned as the source and ELT process as the target wherever applicable):
Structure of the target table: The ELT output table structure is usually not normalized for easy accessibility to the input data. Processes and columns need to be mapped to ensure consistency between the ETL source systems and the ELT target systems.
Data type of the target table: Usually, the structural compatibility between the source and target database needs validation. Accordingly, test cases are framed to warrant data quality in the targeted database. Lastly, the output format of the data is validated to ensure no truncation has occurred to the values of data.
Intermediate phases/models: Usually, the loading or the staging area is divided into intermediate models that include:
- Input data model, which contains data that is clean and completely decoupled from end users.
- Reporting data mart, which contains data that is useful for generating reports for the end user.
- Reporting view, which contains formatted data in a predetermined template.
An effective QA includes testing of all three stages, as shown in Figure 4. The data is also compared across stages to ensure its accuracy and consistency.
Figure 4: Sample intermediate stages in an ELT system
Relationships in output data models: As stated in the previous sections of this article, the main advantage of ELT processing is the easy maintenance of individual stages due to the clear segregation of each stage from the other. Let us consider a scenario where a customer’s loan information is stored in a source table and a record consists of the customer’s first name, last name, address, document name, type, loan type and the loan amount. The target ELT system would require tables on the customer, loan and the document information. After the transition from the ETL to ELT system, the following data relationships would need to be validated to ensure that they are still intact after the transition:
- Customer table to loan table.
- Loan to document table.
- Document to customer table.
- Static table testing of relationship codes.
Backup database: The master and duplicate databases would need to be validated for consistency in terms of database structure, framework and data between them.
Process validation: This entailstesting the data loading and scheduling of jobs by checking the error logs and output of the scripts used for loading. The analysis of error logs is done to report any inconsistencies (e.g., an actual error occurrence not reported in the error log or vice versa). The ELT jobs are then triggered in sequence to test the ELT data warehouse system behavior.
End-to-end validation of the ELT system: All the stages (i.e., the source, input data model, reporting model, reporting view and report flow) are to be tested using predefined input data to ensure that it moves through all stages in the expected way, generating the desired output. This end-to-end validation of the ELT system is essential for detecting inherent defects due to the interdependency of code/data (if any) between all the stages.
ELT isn’t a solution to all the problems of data warehouse systems. In some specific cases, a hybrid solution of both ETL and ELT methodologies would be the appropriate model to adopt, depending on business requirements. To illustrate this, consider a situation where a data warehouse has multiple source systems. The data from these sources can be brought to a common staging area through the ETL process and then specific ELT processes can be further applied to generate user-specific reports.
However, having an ETL, ELT or ETLT as a data warehousing solution is not enough. A clear understanding of the vision of the data warehouse output from an end user’s perspective is essential for a successful data warehouse testing validation. It’s also important to comprehend the business context and clearly define the input data, their interrelationships and the associated processes in the data warehouse system to generate correct and relevant reports for the business/end user.
Sundaresa Subramanian has 11 years of experience in software testing, digital signal processing research and development. He is a Senior Project Manager associated with Financial Services testing practice at Infosys Limited. He has a diversified experience in project management, client interfacing and R&D and has strategized and program managed several functional and data warehousing QA engagements. û ûHe has authored and presented papers on various topics at national and international conferences on the topics of risk identification, data warehouse test strategizing, mobile QA, specialized QA services and project management. He can be reached at email@example.com.