One of the primary reasons for the failure of a data warehousing project is lack of an appropriate architecture. This article talks about a new architectural framework for data flow from the source systems all the way through the presentation layer and outlines the questions one must ask before selecting a particular component within the architecture framework.
The 5x5 Architectural Framework
Figure 1 shows an outline of the 5x5 architectural framework for an enterprise data warehouse in a medium to large organization. This framework contains five stages of data movement and storage combined with five sets of processes to move data from one logical stage to the next. Of course, the number of components and the size of each component will vary from one project to another based on a number of factors, including, but not limited to:
- Size and regionalization of the organization,
- Number and size of the source systems,
- Strategy for future growth,
- Type and distribution of operational data across the organization, and
- Willingness and level of sponsorship from senior management in support of decision support systems in general.
Figure 1: The 5x5 Architectural Framework
The Five Stages of Data Movement and Storage
The following list details the five stages and offers typical questions one must ask before deciding if a stage should exist or not.
The operational databases and other external systems constitute the starting point in the overall data flow. Source systems could include tables from relational database management systems, flat files from mainframes, spreadsheets and other unstructured data.
This is the first stage of data movement from the source systems. There is rarely any data modeling activity performed at this stage. Data in this area can be used for a variety of purposes, such as operational reporting. There is no need to carry out any transformations from the source to the staging area.
Figure 2: Stage One
At this stage, data still resides in a normalized form closer to the source system data model. The data granularity is exactly the same as the source systems. No summarization or aggregation is required at this stage. Cross-application reporting is performed here because data from different sources is available in its original form and quality. True operational reporting is available here.
If summarization is needed, the same can be achieved in two ways: 1) by creating a set of programs that would run at specified intervals and compute summarizations before storing to the summary tables and 2) by using the relational database management system (RDBMS) utilities, such as materialized views in Oracle.
Organizations and smaller data warehousing projects may skip this stage if the online transactional processing (OLTP) database can also be used for operational reporting purposes and no cross-domain reporting is required at this point. Whether the OLTP database can be used for reporting depends on a number of factors, including server features, the utilization level by the transactional system and available resources within the server, network and the database.
Figure 3: Stage Two
This is one of the key stages in the data warehouse framework. Typically, the data model within the data warehouse is normalized. Data integration takes place within this stage. This is the central data repository for the enterprise. No reporting activities are allowed from the data warehouse. Data needed by data marts and other entities within the organization are extracted out of the central data warehouse. No summarizations or aggregations are performed
Figure 4: Stage Three
This stage consists of a number of data marts with data extracted, transformed and loaded from the central data warehouse. Each data mart could consist of data related to one or more subject areas. The data marts are designed based on the business requirements. Summarized data is available within the data marts.
Figure 5: Stage Four
This stage is necessitated by the business intelligence (BI) tool in use. Typically, if the data mart contains multiple subject areas, this stage can be used to create multidimensional databases (MDDBs), also called cubes, for each of the subject areas used by a specified user community. MDDBs can also be created with data spanning across multiple subject areas if requirements exist.
Figure 6: Stage Five
The Five Processes
The following sections discuss the five processes involved in the 5x5 architecture framework.
This is the process of moving data from internal as well as external sources to the staging area. The tool used will probably depend on the number of source systems, the database platform and operating system of the source systems, the volume of data being extracted and the extraction rules being applied.
As shown in Figure 7, the extraction and load mechanisms are pretty straightforward. No data transform is needed here.
The process could be either a replication process (where near real-time operational reporting is required) or a batch extract, transform and load (ETL) in other situations.
Figure 7: Process One
During this process, data is fed from staging into the operational data store (ODS). Complex transformations may not be needed. Storing historical data for up to a year is very common. Historical changes (transformations similar to slowly changing dimensions) are not required because the ODS caters to the operational reporting for the organization.
Additional columns may be available within each of the tables to store information such as when the rows were loaded or last updated. This helps design the incremental loading strategy from ODS to the central data warehouse.
If the intent is to use ODS for near real-time reporting, then a data replication tool can be used. If both staging and ODS are present within the architectural framework, then the same replication instance can be used to push data from the source systems to the staging area and the ODS.
Figure 8: Process Two
This process involves the data load from the ODS to the central data warehouse. Complex ETL processes are involved, and organizations typically resort to a commercial off-the-shelf software suitable to handling the volume of data. This is where transformations related to slowly changing dimensions are carried out. Data is stored at the most granular level possible. The volume of historical data to be stored depends on the business needs of the organization. Lately there have been legal mandates in various business sectors to retain data for more than five years for a number of reasons. Batch ETL processes are used here.
Figure 9: Process Three
Process four moves data from the central data warehouse to the data marts. During this process, summarizations and aggregations are performed as required by the business needs for the end-user community the data mart is catering to.
The process may include simple data extracts as required by various business units within the organization.
Oftentimes there may not be any aggregation and/or summarization for the data marts if the intent is to get detailed information from the data mart, and speed of report execution is not an issue (in situations where reports can be scheduled to run at night, and the results cached for viewing the next day).
Figure 10: Process Four
This process is typically driven by the BI/OLAP tool being used. The tool connects to the data mart, creates MDDBs as needed and stores them on a server for access by the users at a later date. Depending on the architecture of the tool being used, this MDDB may or may not contain data. If no data is present, the data mart is accessed on the fly to get the latest information.
Figure 11: Process Five
This is a standard generic framework for an enterprise data warehousing architecture. Criteria for including or excluding a specific component within the overall framework have been discussed. I hope that this will guide data warehouse/BI architects to pick the best architecture for their data warehousing efforts.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access