The pace of business is faster than any time in history. Demands for information on customers, products and financial information are higher than any previous time. Acquisitions and mergers are commonplace, causing great change in the business environment. At the same time, staff has been cut to the bone in a time of economic slow down. All these factors combine to put an enormous squeeze on reporting systems and the staff that maintains them.

How can an IT organization cope with these conflicting constraints and demands? One way is to implement a reporting system centered on a data warehouse that is architected on principles that facilitate rapid change. This yields an "agile" data warehouse that enables a staff to make swift changes, thus satisfying reporting demands.

Six Architecture Principles for an Agile Data Warehouse

Here are six architecture principles that can be used to create an agile data warehouse. Each one contributes toward the goal of enabling rapid change to the warehouse. I was recently part of a team that implemented a warehouse focused on these principles. The team is currently enhancing the warehouse to meet rapidly developing business needs. These principles stand out as keys to our responsiveness. They are:

  1. Assemble a staff conditioned to change
  2. Use meta data-driven development tools
  3. Isolate subjects and data tables
  4. Bite-size analysis through denormalization
  5. Use surrogate keys
  6. Frequent source system feeds

Your existing data warehouse may already use one or more of these principles. Some of these can be added to an existing warehouse. Others are the infrastructure on which the warehouse is built.

Assemble a Staff Conditioned to Change

This isn’t trivial. We all go through change, and often it is difficult. Attitude can become a big hurdle to responding to business requests. When selecting staff, look for workers willing and able to cope with change. The staff must come to the team knowing they will have to embrace a constant stream of requests and new requirements. These requests will push the envelope of complexity and data volume. Sometimes they try the patience and endurance of the team members. These things must be understood and accepted by the staff when the team is selected. This is the agile data warehouse, not the monolithic, hands-off, legacy data warehouse, so expect to embrace change.

Meta Data Driven Development Tools

Meta data can reduce the time to make changes to an existing warehouse as well as speed design and implementation of a new warehouse. Meta data assists in rapid delivery of new information when changes are requested. Meta data development tools accomplish this by storing information about the warehouse in tables. This is more than an enumeration of the tables and columns in the warehouse schema. This is information on how the warehouse tables are related to the source tables; how extraction, transformation and loading is done; the meanings of the data tables and columns and their uses; and how the warehouse data tables relate to each other for query purposes. The development toolset is the engine that builds meta data and puts it to use. Ideally this meta data can be accessed by tools other than the original development toolset. If the meta data is open, it is available for numerous utilities to access, resulting in a broader selection of tools. This is what is needed for the agile data warehouse. A set of standards has been created for data warehouse meta data. It is the Common Warehouse Metamodel (CWM)™ from the Object Management Group (OMG). If the meta data is structured to according to this standard, tools from more vendors will be available to work with the meta data. Meeting the CMM standard is preferred but not critical for meta data to be open. If the tables and columns that hold the meta data are documented and available for use, the meta data is open. This allows developers to use tools outside the original development toolset to leverage the meta data for increased productivity. In our case, the meta data tables and table relationships are documented, so we use SQL based utilities to speed up our work. The meta data is very useful even though it does not meet the CMM standard.

What Comprises Meta Data?

For purposes of this article I am using the term "meta data" as information that describes and controls how the data warehouse is built, loaded and queried. This includes table and column definition, sourcing of columns, table relationships and keys, data movement, scheduling and execution statistics. There are other types of meta data which are not critical to this discussion. I am addressing meta data that is created and used by design, development and extraction tools. Here is a broad description of the information that constitutes our meta data:

Table Information Name, description and purpose, type, label, indices, origin, relation to other tables.
Column Information Name, description and uses, data type and length, label, origin, transformation code string, key information, relation to other columns.
Schedule Information Run schedule, dependencies, scheduled groups.
Execution information History of extracts, status of runs, last update times and related statistics.

Examples of Meta Data Advantages

Here are some of the ways meta data-based tools speed the process of change.

  • Define new tables and columns from existing definitions. In a GUI design tool add a table by copying an existing table and modifying it as required. Drag and drop a defined column to another table where it is needed.
  • Locate all occurrences of a column and update the length in every occurrence.
  • Determine if an application column has been pulled into the warehouse and is available for query by searching the column sources and transformation code strings.
  • Provide a real-time data dictionary based on the meta data. Using DHTML we developed a Web-based data dictionary. Cold Fusion running against the meta data tables produces indices of subject areas and tables with ability to drill down on a table to see the columns, description, indices and keys.
  • Provide information to debug or "debunk" problems with execution meta data. Each row in our data tables carries a key to an "audit" dimension. That table has execution information from the scheduler that can be used to locate, modify or roll back changes made during an extract.
  • Compare versions of meta data to find differences. This is used for new development and installing new releases. It is very effective due to a GUI merge tool that locates differences and assists in resolving the differences.

In our data warehouse meta data is used by many tools. Some of these tools are provided by the vendor of the development toolset, like the extract, transformation and load (ETL) tool. Many other tools we developed to suit our needs. We developed utilities for data dictionary, extract status, report writers, end-user layers, security, analysis and implementation.

Isolation of Subjects and Data Tables

Isolating tables into groups based on each business subject area facilitates rapid change. It does so by reducing the breadth of knowledge required to understand the tables and, therefore, to make changes. The warehouse tables should be designed in groups that focus on each business subject and named so they group together by the business subject for which they are created. Each subject area should be designed to stand on its own. Connections to information in other subject areas should be limited to foreign keys that reference entities or dimensions. This breaks the warehouse into chunks, allowing a developer to understand a business subject area without having to grasp the entire data warehouse.

For example, an accounts payable (AP)analyst requests a particular quantity be made available concerning receipts related to purchase orders (POs). The data warehouse developer has little understanding of POs and the business rules involved with receipts at the receiving dock. However, if the all the AP tables in the data warehouse are grouped by naming convention and documented as a group, the developer has a limited amount to tables to search through. The set of tables related to AP will quickly be located, and the developer will be able to ignore the tables for other subject areas. This also facilitates rapid understanding of the subject area, shortening the time for the developer to become proficient in AP.

This principle is a common practice in data warehouse design, but let’s take this example further in the next architecture principle.

While emphasizing isolation, I also must stress the importance of sharing entities or dimensions across subject areas using foreign keys. This is not contradictory. For example, GL accounts are used in many business subject areas. One GL_ACCOUNT table can and should serve all subject areas needs for GL account information. Each subject area can refer to the table using foreign keys. This is prevents "application silos" that defeat the true purpose of a data warehouse. These shared tables allow integration of subject areas and in dimensional modeling are referred to as "conformed dimensions." Nevertheless, the GL_ACCOUNT table is designed and named as part of the GL subject area. If the subject areas are to be set up or sold separately, the GL_ACCOUNT table would be distributed with each subject area. When a subject area is installed, the GL_ACCOUNT table would only be installed only if it does not exist already.

Bite-Size Analysis Through Denormalization

The meticulous design of the warehouse data tables facilitates rapid change of these same tables later in response to change requests.

In the previous principle, a developer was asked to make a change relating receipt quantities to POs. Suppose the AP subject area has only tables named for entities like PO, Invoice, Location, Distribution, Shipment, Receipt and Vendor. The relationship between these tables is unclear to the developer. The developer will have to get help to learn the relationships and business rules involved. Also multiple tables and their sourcing will have to be understood before the change is made. The developer does not feel smart and begins a series of phone calls and e- mails to learn about POs and receipts.

Now contrast that situation with this scenario. Suppose the developer looks at the set of tables in the AP subject area and sees a table named RECEIPT_DETAIL_FACT. This obviously relates to receipts, and the developer notes it has a PO key. In the meta data definition, this table has the business rules relating receipts to POs. In the first few minutes of analysis this developer has set his focus on one table out of the hundreds in the warehouse. He then reviews how this table is sourced and how it relates tables and business rules. Suddenly this developer has a good handle on POs and receipts! The developer has leveraged the knowledge that was designed into the data warehouse model, and the results are a shortened learning curve and a speedy change. The developer’s morale is elevated because the developer gained knowledge, feels smart and made the customer happy.

The change is faster in the second case, because the developer was able to quickly learn the relationships and business rules as modeled into the RECEIPT_DETAIL_FACT table. The developer needed understanding of only a single table in a single subject area to make the required change. This was all because RECEIPT_DETAILS_FACT is a denormalized table. This type of table is defined by pulling together all information required to describe an event or transaction, using foreign keys to link in supporting tables. This technique is embodied in dimensional database design, where fact tables relate to dimension tables with foreign keys.

Dimensional database design is the defining discipline of creating isolated fact tables like RECEIPT_DETAILS_FACT. In our data warehouse, most fact tables are extracted with one SQL statement. Understanding only one complex SQL extract statement involving five to 10 application tables allows the developer to design a change. Thus, the complexity of the table relationships is broken into bite- size pieces. The developer does just enough analysis to make the change without concern about harming relationships to other tables. The meta data is changed causing the physical table and ETL to be adjusted. Time and again we have been able to implement a change to a fact table in 24 hours because the analysis was focused down to one table and one SQL extract statement. Adding a column to a dimension table is less demanding, often with turnaround time within one workday. That is not to say the every request can be handled this easily. There are requests that require new tables for facts and dimensions. These require meticulous design, but each one is built to made subsequent changes as quickly as possible.

Surrogate Keys

Surrogate keys facilitate adding information from multiple application systems, as often exists in companies that have done mergers and acquisitions (M&A). Our project sponsors stated from the outset that we must expect acquisitions and mergers, so we emphasized surrogate keys in our architecture. Surrogate keys are our proactive strategy for combining information from multiple application systems. Surrogate keys facilitate combining information from new source applications into existing warehouse tables by adding dimension rows from the new systems with keys that are unique, consistent and independent of the production keys. Surrogate keys are generally system generated and are non- intelligent.

Why are Surrogate Keys Important?

When combining information from multiple application systems, it is likely duplicate keys could result. Surrogate keys circumvent duplicate keys. For example, imagine two order entry systems that are combined in one data warehouse. Suppose each application system has a customer number 100, but that represents company ABC to one system and company XYZ to the other system. In this situation customer number 100 is referred to as a "homonym," one number referencing two different customers. If customer number were used as the primary key to the customer table, the sales to the two different companies could not be differentiated. Therefore, the ETL process must assign a different surrogate customer key for the customer. The surrogate key is an additional key that is internal to the data warehouse that keeps the customers and their sales properly identified and separated.

To maximize the power of the data warehouse, the information should also be combined in a manner that conforms similar information. For example, company EFG is a customer in both application systems, but in one system EFG is customer number 200 and in the other it is customer number 300. In this situation customer number 200 and 300 are referred to as "synonyms" for customer EFG. All sales for EFG should be conformed into one customer key for maximum analytical power. In this case the same surrogate customer key is assigned to all sales from both application systems to company EFG.

Frequent Source System Feeds

To gain maximum return on investment (ROI) on our data warehouse, we determined information needed to be within two hours of real time. We considered the customer that requires information no less than two hours old. If the data warehouse can be used by that person as an alternative to reporting from the application system, the application server will carry a lighter reporting load. By expanding the data warehouse user base and reducing the load on the application server, we increased the ROI of the warehouse.

Often, keeping information this fresh is accomplished using an operational data store as part of the data warehouse. In our case, we used date/time-based data capture to implement rapid refreshing of the entire data warehouse without using an ODS. To make frequent refreshes efficient, extracts are based on last update date, including the time to the second. This timestamp is available on every table sourced from the application system and has an index built on it. A trigger on the source table updates the timestamp on a row when it is created or updated.

There is special logic on some data warehouse data tables to account for deleted rows in application tables. For example, orders can have order lines added and deleted until the time the order is "booked." In each extract cycle, we replace all lines for all orders not in booked status. This removes from the warehouse any rows that were deleted in the application.

Other Factors

Consistency Required. Of course any tool or technique can be difficult to understand with inconsistency in design and implementation. It is dangerous to take things for granted, especially in an inconsistent environment. Thankfully, one team has had control of our data warehouse from inception, and this has helped keep it consistent. The vendor of our package had clear development standards which they enforced when building the meta data. They educated our team in those standards during initial installation of the package.

Purchased Knowledge. Another factor in our initial implementation was the knowledge we were given when we purchased the packaged data warehouse. I am referring to the target data model, the ETL meta data and the table relationships included in the package. Purchased knowledge is the biggest advantage of our pre- sourced data warehouse package. The meta data embodies the relationships of the source tables and target tables. It encodes business rules in the ETL and provides a developer with understanding of the source tables. The meta data educates the developer as the developer works with it. Combined with ERDs of the source application database, the meta data helped the data warehouse team make a very steep learning curve.

Results

The end result has been satisfied customers and promotions in the IT organization. Of course the ultimate reward for this good work, is more work! In today’s environment that is a good thing. The scope of this warehouse is being expanded because it is successful.

Our company and our team are now reaping the benefits of an agile data warehouse. It has been an unqualified success. We are currently working hard to keep up with the pace of business. Our small team wouldn’t have a chance if not for the benefits derived from the principles outlined in this article.

Shortcomings of Our Implementation

There are shortcomings in every implementation. Here are some areas where we are weak or could have done a better.

  1. Political support. The more, the better! Management support can often change to discontent when difficulties arise. When more funds are needed for equipment or consultants or software, you want to be sure that your project won’t run out money. When you miss a date on the schedule, you don’t want management to be quick with a negative outlook and reaction. Our project took a risk on this point, but we were fortunate. We didn’t have many management members for support, but the ones we had were strong and committed.
  2. Measurable requirements for implementation. Do you want to know when you are finished? Do you want to be able to say, "Look what we accomplished!" If so, you need measurable requirements. You could be in a position where you never get a "post-project" reward, because you can’t get ahead of the scope creep. At the project inception define what you will accomplish and make clear the change management process. This is not to say you will reject change. These requirements are necessary to track and control change during the process. If you take reasonable steps to manage change, you will be able to roll with the punches when changes are requested. You can differentiate the necessary from the nice to have. And you can justify the additional time it takes to accommodate scope creep. We had some challenges because additional subject areas were added. Also, one function that was specifically excluded in the requirements returned to our team just as we went live. We didn’t throw a fit about these events. We noted them as variations from the original requirements, and because they were justifiable business needs, we implemented them.
  3. Product features. Know what to look for in a packaged product and verify that it is there. Verify it thoroughly. These were critical to us:
    • Open meta data
    • Surrogate key logic
    • Dimensional design
    • Content of data warehouse tables against requirements
    • Product references, to ensure a mature product and vendor support
  4. Surrogate Keys. The package we choose has surrogate keys, but some of the function was lost by the manor of their implementation. This hampers our work with acquisitions and mergers. The pre- sourced ETL meta data used surrogate keys values from the application to populate the surrogate keys in the warehouse. This effectively makes the keys non-surrogate. The ETL should have generated new keys internal to the data warehouse.
  5. The first package we selected failed. It was functional only if a lot of consulting was purchased to implement it. After two months of little progress, we dropped that vendor and made a better choice.
  6. Insolate extracts from source system changes. This isn’t a big factor right now. It will strike a year or two after the warehouse is in production. If the source tables change significantly, for a new release, then the ETL will have to change with it. This can cause a huge amount of rework. It is one of the reasons our previous data warehouse was replaced. The man hours to adapt the old warehouse to the new release of the application system were prohibitive. It was more than the cost of this new data warehouse. How do you isolate extracts from change? Use interface files if your application systems have them. Some ERP systems have tables designed for just such a purpose.
  7. Reward the staff. Financial rewards may be called for and should be planned. Also, many people committed to the project expect a sense of accomplishment and appreciation beyond monetary reward. Make a big deal about deployment, celebrate milestones and give speeches about accomplishments. Did you know that the greatest desire of a leader is acknowledgement? People want to feel like they made a difference. This means good reviews at appraisal and compensation time. Give time off within two weeks of implementation to make the most of it. The team members probably missed a lot of dinners with their family: their wives may have built up long "honey-do" lists, their husbands may have been eating microwave dinners for months. Which leads to another point. Don’t leave the spouses out of the celebrations. If you have a team dinner, lunch or reward ceremony, invite the spouses! They participated by making sacrifices at home, so let them participate in the reward.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access