Free Site Registration

Project Management Requirements Unique to Data Warehousing

Information Management Special Reports, May 2007

Alan Schlukbier

Recently, several project managers approached me about some common misconceptions about data warehousing. Knowing that I had worked for over 15 years as a consultant and completed over 100 data warehouse contracts, they urged me to publicize what is unique to data warehouse projects from a project management point of view.

The name data warehouse is largely a misnomer. A data warehouse does not contain all possible information about every subject area in a business. Nor is it a structure appropriate only to very large organizations. It can be employed very well in smaller organizations if their products or services are very complex and in need of better process coordination.

Data warehousing is essentially a logical technique used to dimension certain comprehensive facts in a chronological fashion for the purposes of measuring change in certain activities in a business or government enterprise, for example, a data mart that contains all the revenues and expenses associated with products within a business. Statistical analysis of expenses of certain periods can reveal patterns of change peculiar to a business. These patterns can be statistically measured and plotted and provide predictive behaviors for future activity.

Advertisement

Most project managers have difficulty initially understanding that a data warehouse project only involves one real process, the loading of a fact table. A star schema data structure consists of one child table that contains a limited number of facts which are fully dimensioned by parent tables that contain all possible permutations of related dimensions which fully describe the facts. Typically, a business subject area contains fact tables of revenues and expenses dimensioned by date, location, department, product, region, customer type, etc. Government and scientific schemas often contain more facts and more dimensions.

Once most project managers understand that there is essentially only one process within a data warehouse, they begin to understand the importance of the initial analysis that precedes the design. For this reason, the charter or mandate must contain a lot of analysis that will validate the business objectives through the content of the star schema(s). It is not unusual to have complete designs for a data mart before the mandate is presented. Senior management must also have very clear commitments to the ETL process and tools because these are the main resources required to load the star schema or operational data sources. So it often seems to the inexperienced PM that there is very little to do after the mandate or charter is accepted. Actually, it is foolish to even raise a data warehouse solution as a project unless a fairly large portion of preliminary analysis is done. For example, the first question that must be addressed is deciding about the subject areas. Ralph Kimball, father of the data mart, insists on a complete inventory of all possible subject areas for a business or enterprise. This can provide an excellent baseline for all future data marts and offers greater visibility to metadata consistency and the conformance of dimensions. It provides a map of tactical possibilities for the efficient implementation of star schema designs, taking advantage of repeated dimensions that can be conformed to consistent groupings. Certain dimensions such as date, time, organization, location, etc. must be consistently defined or comparisons of different facts would yield inconsistent results. It would be like comparing apples and oranges. Take organization, for example. Expenses aggregated by different organizations could not be compared. A department consisting of five units cannot be compared to the same department which only consists of three units. This can easily happen if consistent definitions are not established for these departments. Conforming these groupings or dimensions is, therefore, significant to the results. Some of these groupings used in the dimensions will also change over time and must all change consistently. Department A may be vastly different from the same department 10 years earlier. Rules must be established to consistently load facts for this department over different time periods. This is no different than what auditors do in preparing Pro Forma financial comparisons from different time periods. It is all right that the composition of Department A changes from five to 10 units in a 10-year period as long as all other financial comparisons use the same rule and change consistently.

Slowly changing dimensions are just one issue that awaits the ETL programmers as they start loading the star schema dimensions and then hook these elements to the facts. Because it may take several months to load a fact table of 10 years, it is very important that the QA program be in top gear to validate all the source data and the ETL used to load it. Very few organizations have the gumption to discover a design flaw and then reload a fact table. This is why so much analysis has to take place before loading is begun. ETL programmers are like the short-order cooks of a good restaurant. When all the right ingredients are well labeled and accessible, cooking can go quite smoothly. But when people have to stop processes to look up definitions and argue about the sources, it can be hell. The project plan must document, prove and validate all sources, their definitions and thoroughly test the ETL before loading can begin. This means a complete backup and recovery plan for all data sources, transformations and "pulls" must be organized, documented and thoroughly tested. It is also a good time to think about how to unload a fact table once the initial time period for a subject area is reached. Too many organizations simply abandon fact tables because of errors, corruption or because they become too unwieldy and slow.

The data warehouse also has to demonstrate that it meets the original business objectives. Reviews and lessons learned are too frequently ignored in the zeal to complete the next star schema. Therefore, most organizations need a good plan to understand what they are getting in to and to plan with some automated tools such as an ETL tool. There are other tools, as well, such as repositories for the source data and tools to schedule and report on "pulls" of data loading. All of these tools are expensive, but the ETL tools can pay for themselves in the first few months of operation.

Page 1 of 2.

Advertisement

Advertisement