William wishes to thank Mike Cross (mcross@racenter.com), data warehouse director at Rent-A-Center, for his contributions to this month's column.

Over the next few columns, Mike Cross and I are going to address a myriad of best practice data warehouse architecture subjects that we believe have been left out of the literature or at least not dealt with on a detailed, implementation level.

What Really Belongs in a Data Warehouse

There is a notion that all of an organization's data, both current and historical, needs to be in the enterprise data warehouse, regardless of its current or future necessity. In today's world of exploding data, limited resources and viable architectural alternatives, this is not very practical.

Within most enterprises, there are three general areas where data is stored and maintained. Each of these serves different specific purposes and should be populated and maintained uniquely. The operational systems supporting the day-to-day operations are often transactional and are concerned with the here and now. Data warehouses provide historical depth and breadth, are designed to provide an enterprise view of data and are the foundation for data mining and sophisticated business intelligence (BI). Data marts are primarily for specific, parochial requirements, alleviating cycles from the data warehouse and utilizing application-specific transformations. None of the three should contain the exact same data or try to act as substitutes for each other.

All operational systems have data that is of little or no use in a data warehouse, and every attempt should be made to resist loading this data. Have the courage to say no! The Rent-A-Center (RAC) point-of-sale system, has 10 different fields for free-form text comments about a customer, from where to deliver their big-screen TV to the best time to call. All of the fields are valuable to the store personnel and have operational significance (when used in their prescribed manner), but offer little, if any, useful information for the purpose of BI or data mining. And because RAC treats the customer information as a slowly changing dimension, every change in customer information creates a new record in the warehouse.

From an historical perspective, do we care what an individual customer's delivery note was on a specific day? We couldn't think of a business reason to include it, so that data doesn't make it to the data warehouse. If, however, the answer to a question is yes, we press hard for the true business requirement and do not just pick it up because it is customer data. This practice, compounded, could result in a data warehouse so bloated that load, query and backup cycles are noticeably elongated. The defense that the users wanted all possible data begins to pale at that point. Even if this is a real-time, operational data warehouse, it is no substitute for the operational system, which should also be considered for modification as appropriate.

However, you may load it in the staging area if you have one. If you want to bring it into the data warehouse at some future point, your ETL is already primed for it. Some architects go a step beyond and keep history in the staging area as well - after all, if you want a field, you want its history, too. In this case, you are also rolling the dice that the field is exactly as you will want it in the data warehouse (when you want it in the future) because you wouldn't bother to do transformations on data that you will not be bringing into the warehouse now. That is too much of a stretch, so limit your source system "triage" to ETL and age-off your staging area every few days or weeks.

In general, data marts are aggregates, contain application-specific transformations and redesigned representations of the data warehouse data and are optimized for reporting and quick analysis. Data marts can also serve as repositories for transitory data that needs to be reported but not maintained historically. A primary example of such is the RAC data mart for exception reporting. As with any retail business, operational exceptions happen in stores almost every day, e.g., price overrides and missing merchandise. Operations should have an interactive system that identifies these exceptions and allows for the entry of comments and explanations that are reviewed at different levels with the organization. The exception items are derived from information within the warehouse, which feeds a data mart for the entry and review of comments. These comments, however, are never fed back into the warehouse, as they have no real historical significance.

Source systems may be excluded from the data warehouse because of the evolving capabilities of enterprise information integration (EII) approaches and EII's ability to combine data from the data warehouse and an operational system in a single, albeit limited, query. If this technology is enabled at your organization, given EII's advancement in handling multiple databases in multiple formats, referential integrity, XML and basic transformations, it may serve as the method for appropriate, selective exclusion from the data warehouse. EII still has a long way to go (query tuning, two-phase commit, business metadata, memory constrained, etc.). Data warehouses are still absolutely vital, but EII shows promise and is another factor chipping away at the need to overload the data warehouse.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access