Please explain the architecture of a data warehouse or direct me to books with figures so that I can visualize it.


Doug Hackney’s Answer: Check out the resource library at www.egltd.com, look in the PowerPoint presentations for DW 101 and Federated BI Architectures. There is a complete visualization of the basic concepts there.

Mike Jennings’ Answer: The Data Warehouse Lifecycle Toolkit by Ralph Kimball, Laura Reeves, Margy Ross and Warren Thornthwaite; Building the Data Warehouse by Bill Inmon.

Chuck Kelley’s Answer: Good books that you should read are 1. The Corporate Information Factory (Bill Inmon, Claudia Imhoff, Ryan Sousa); 2. The Data Warehouse Lifecycle Toolkit (Kimball, Reeves, Ross, Thornthwaite); 3. Building the Data Warehouse (Bill Inmon).

Joe Oates’ Answer: You can consult Ralph Kimball’s Data Warehouse Lifecycle Toolkit or one of the several books that Bill Inmon has written.

Les Barbusinski’s Answer: Whoa! That’s a lot to ask for in a small column like this one, but here goes:

A typical data warehouse consists of the following components: a data acquisition subsystem, a storage management subsystem, an operational reporting subsystem, an analytics engine, an information delivery/distribution subsystem, meta data and infrastructure (see Figure 1).

Figure 1: Components of a Typical Data Warehouse

The data acquisition subsystem is responsible for bringing new data into the data warehouse. It uses extract, transformation, load (ETL) tools such as Informatica, Ascential DataStage, Sagent, Data Junction and others to:

  • Extract data from a variety of internal and external source systems. (This may involve the use of direct SQL queries, FTP or messaging tools such as BEA WebLogic, IBM MQ-Series, Sonic-MQ, etc.)
  • Stage it in the data warehouse (either as flat files or staging tables).
  • Synchronize the data (i.e., resolve timing issues between source systems).
  • Cleanse the data (i.e., apply business and integrity rules).
  • Transform the data (i.e., apply consistent formats and domains of values).
  • Integrate the data (i.e., resolve overlaps and intersections).

The resulting data is usually stored in two different data stores: a historical data store (which holds master, historical and transactional data in a normalized format) and an operational data store (which stores a selected subset of data in a denormalized format). The latter is used by the operational reporting subsystem to generate lists, transactional and exception reports using such tools as Crystal Reports and Actuate.
The analytics engine is responsible for transforming the cleansed and integrated data in the historical data store into actionable information. It does this by first aggregating and reorganizing the data into one or more derived data structures:

  • Simple extract files and/or tables,
  • Dimensional tables (i.e., star schemas or snowflakes),
  • Multidimensional hypercubes.

The analytics engine then puts this derived data through a variety of tools to generate analytical reports, alerts, statistical analyses, forecasts, classifications and segmentations, business models, rule and simulations. These tools include:

  • Business intelligence tools (e.g., Essbase, MicroStrategy, Business Objects, Cognos, Brio, etc.).
  • Data mining tools (such as those from IBM, HNC, SAS, SPSS and others).
  • Statistical analysis tools (e.g., SAS, SPSS, etc.).
  • Data visualization tools (e.g., ClearSight, DataVista, nVizN, etc.).
  • Clickstream analysis tools (such as those from Ascential, Informatica, Oracle, WhiteCross, WebInsight and others).
  • Forecasting tools (e.g., DecisionTime, Forecast Pro and others).
  • Predictive modeling and simulation tools.
  • Geo- spatial/mapping tools (e.g., SAS/GIS, MapScape and others).

The storage management subsystem is responsible for controlling the storage and propagation of data within the data warehouse. This includes such functions as security administration, backup and recovery, archiving, replication, gateway administration, checking data integrity, etc.
The information delivery subsystem is responsible for distributing two things to the user community:

  • The transactional and exception reports generated by the operational reporting subsystem, and
  • The actionable information (i.e., analytical reports, statistical analyses, alerts, forecasts, etc.) generated by the analytics engine.

This can involve a variety of "channels" including dedicated Web sites, Web portals (via a Web services API), e-mail distribution, message brokers (e.g., utilizing MOM or EAI technologies), wireless transmissions (e.g., alert messages and/or e-mails sent to PDAs, cell phones, pagers, etc.), embedded analytics in BI or neural net applications (e.g., fraud detection), etc.
Meta data is the "glue" that holds the data warehouse together. It maps source data elements to DW data elements (i.e., tracking the "information pedigree"), provides business and technical definitions for all DW data elements, stores transformation and integrity rules used by the data acquisition subsystem and many other things.

Infrastructure refers to all those internal functions that make the data warehouse run smoothly, but are not directly involved with maintaining the data. This includes such things as job scheduling, workflow management, directory management, file versioning, event tracking, etc.

Scott Howard’s Answer: Two of my favorites for the DW beginner: Building the Data Warehouse for Decision Support by Vidette Poe which introduces how you may benefit from a DW and Barry Devlin's Data Warehouse, from Architecture to Implementation which is currently used in many universities' introduction to DW syllabi. Both contain many figures and a detailed justification for their recommended approaches.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access