The focus in many business intelligence (BI) initiatives lies in the creation of the actual data warehouse and the data marts. There are techniques for the modeling and implementation of these two components of the BI environment. This is appropriate because the resulting databases represent a huge intellectual capital and memory base for an organization, driving the decision-making processes on tactical and strategic levels. The overall process for building the data warehouse was discussed in a series of DM Review columns (published in January, February and March 2000) by one of my associates entitled, "A Data Warehouse Approach for the New Millennium." In this column, I will drill into more of the details around data population of the data warehouse, which will provide an organization with a solid foundation for a business intelligence solution.

The BI Environment

During the development of the BI environment, attention is often focused on the databases. However, the BI environment consists of a number of other components which are at least as important as the databases. First, meta data and data quality subsystems are responsible for the management of meta data and guarding the quality in the BI environment. Then, there are the information applications such as query and reporting, MOLAP and data mining.

Very often, project teams consider themselves halfway down the road to completion as soon as the databases are designed and built. But being halfway down the road isn't typically the case. For bringing the data into the warehouse, an extract, transform and load (ETL) tool has typically been purchased and a mapping specification diagram has been created. In most complex BI environments, the technical complexity of multiplatform source systems and the even more complex business rules cannot be shielded totally by the ETL tools. Starting directly with the tool can result in nonmaintainable "spaghetti," so that performance is unpredictable and the system cannot run unattended.

Data Population Guidelines and Techniques

In the detailed design phase of the project, detailed consideration is given to the data that needs to flow through the BI environment and how this can be done in the most optimal way (keeping the possibility of reuse in mind). The following are some guidelines and techniques for populating the data into the databases:

Step 1: Define Mapping Specification. This should include the detailed source and target definitions in terms of elements and their formats. Also, high-level translation rules should be documented while taking into account volume, timing and quality issues.

Step 2: Confirm Iteration Subject Areas. The iteration subject areas are groups of data elements that can be logically isolated into one flow of data within the BI environment, independent of another flow of data. All iteration subject areas together form the data populating subsystems between sources and the data warehouse, and between the data warehouse and the data marts. It is important to confirm these subsystems before moving forward.

Step 3: Design the Data Population Extract, Transform and Load Process. At the highest level, there are three processes in designing the data population process – extract, transform and load. Within each process, certain functions will need to be performed. For instance, in the extract process, a combination of extraction, filter and transfer can be used. Functions present in the transform process include lookup, currency conversion and enrichment. The load process also contains a number of unique functions. Figure 1 lists reusable functions that together can fulfill the requirements within these ETL processes.

Function Description
Extraction Get the data out of the source system
Filter Filter the required data
Feeder data integration Integration of different data files/tables from the same source system
Data delta detection Detect changes within the source systems and determine if they are relevant for the data warehouse
Data Transfer Transfer of the data between the source system environment and the data warehouse platform
Cross-feeder data integration Integration of different data files/tables from different source systems
Cleansing Cleanse the data
Data quality handling Apply quality measures to the data
Enrichment Combine the data with items to enrich its meaning
Lookup Perform a lookup with the data and a specified table
Currency conversion Convert to a standard currency
Algorithmic function Perform a specific algorithmic function on the data
Build Create the target record
Audit Perform source and target checks
Load Trasfer the data into the data warehouse table
Post-load Perform after load activities (delete temporary tables, recreate index)
Archive Create archival export file
Backup Create backup file
Meta data and statistics Update the meta data and statistics concerning the run of the iteration subject area

Figure 1: Table of Functions

Other important drivers that are critical in the design phase are performance, maintainability and reuse. Graphical representation is also a very strong aid in this step.

Step 4: Implement the Iteration. The functions from the design phase can now be translated into the ETL tool, using the facilities it provides in combination with the total environment (e.g., fast loader, SQL). The implementation of each iteration should then be put into a scheduling mechanism to automate the loading of data.

This approach will support the creation of a data populating subsystem for the business intelligence (BI) environment. Real-time data populating may also be required. A feed of mass amounts of data in real time will draw new requirements on the populating subsystems in combination with the database subsystems. I will talk about this in a future column.

The use of a structured approach in the design and development of the data populating subsystem is highly recommended. This will help you avoid unnecessary surprises on system performance and support automation – an unavoidable approach as we enter the digital business era.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access