Continue in 2 seconds

Real-Time Data Warehousing Defined

  • October 08 1999, 1:00am EDT

We are now poised for the most radical transition in the short history of data warehousing. We are about to witness the arrival of real-time data warehousing (RTDW). Despite its name, the impact of this change is due less to the increase in speed of availability (which is significant) than to a significant reduction in overall processing complexity. We will replace a brute force process with elegant, yet powerful, simplicity.

Real-time data warehousing will eliminate the artificial bottlenecks we have designed into our solutions from the very beginning. It will solve some of the most vexing data warehouse problems such as change data capture. It will allow us to return to co-engineered, end-to-end system architectures.

So what is RTDW? It is a historical and analytic component of an enterprise-level data stream. This data stream supports continuous, asynchronous, multipoint delivery of data. In other words, data moves straight from the originating source to all uses that do not require some form of staging. This movement takes place soon after the original data is written. Any time delays are due solely to transport latency and (optionally) minuscule processing times to dispatch or transform the instance of data being delivered.

The data warehouse component of this information delivery architecture contains both a real-time and a static partition. The real-time partition gets its data an instance at a time from a transformation engine. At regular intervals, consistent subsets of the real- time data are written to the static partition. The static partition maintains the history and serves as the base for supplying dependent data marts with periodic data.

Diagram and Sidebar: Components of a Real-Time Data Distribution Environment

Traditional Data Warehousing as Archaeology

Building a data warehouse is like an archaeological expedition. You generally know what you are looking for, but you have to decipher dusty old records to find where to dig. You set up elaborate scaffolding that can only be used once to get at the buried artifacts, and the work is always dirty. When you find the old relics, you must carefully package them for transport to a laboratory where they are unwrapped and reassembled for display. This is all massively labor-intensive; and, in the end, you only have a bunch of old stuff that represents an imperfect subset of what actually happened in the past.

Why is the record incomplete? A flood may wash away a whole layer of the fossil record. This is similar to a purge or selective but undocumented periodic deletions. A geological intrusion can replace something without leaving evidence of what was there before. Our operational systems allow the update of information that effectively destroys the historical record. The items that survive are those that were built to last or were intentionally preserved based on the cultural preferences and bias of that time. Since it is very expensive to store things to last, few things survive. Likewise with information systems, we have been very selective about what we keep since we generally view storage as a scarce and costly luxury.

Perfect Capture

First and foremost, real-time data warehousing eliminates the dig. The data we need now and the data we will need in the future streams out of the sources continuously as they come into being. The vast majority of the labor-intensive development effort, the 65 percent to 80 percent involved with extraction, is eliminated.

Changed data capture is the process of collecting incremental net changes so we can record history accurately. It allows us to maintain point-in-time currency. It is essential to doing data warehousing right; but it is often very difficult, if not impossible, to do after the fact by reading operational data sources in batch. In RTDW, a perfect form of changed data capture is available at no extra cost. Both the before and the after image of a record is available in memory supporting easy and efficient delta processing. For dimensional data, you have much more freedom in how you represent slowly changing dimensions.

When an event occurs, such as shipping an order, all relevant information can be captured simultaneously. Today, if we get an extract from a shipping subsystem, it is not likely to have relevant customer, product, order or invoice cross-references necessary to fully describe the event. We have to piece them together by merging data from multiple sources to artificially recreate the context of the event. Differing update timing may make some of the associations spurious. There is enormous advantage in being able to capture the full context of an event at creation time.

Quick Reconfiguration

By definition, our systems record only what we tell them to record. Even the monolithic enterprise resource planning suites cannot be designed to anticipate everything we may need to analyze or record for posterity. The ideal capture mechanism for a RTDW is an application service that uses templates (such as SAP’s IDOC) that specify a data context with an associated trigger event.

Such a template-driven mechanism allows the data warehouse designer to reconfigure an existing capture specification to include more content in hours. A wholly new specification can be ready for implementation in a matter of days. The limiting factor for availability is likely to be the turnaround time for production migration rather than development time. This is in contrast to a minimum of three to five days for building a single-purpose interface when using an extraction, transformation and loading (ETL) tool or the two to three weeks when writing it by hand.

Time Compression

As the pace of business picks up, processing cycles get compressed. Buying via mail order used to take a week or two. Express delivery services helped reduce this to two or three days. Now when you buy by phone or on the Internet, you can have your product the very next day. This means the product is delivered to the shipper the day the order is received.

Keeping pace with this speedy processing puts a strain on the traditional distinctions between operational and analytic information services. We use trended data regarding what sells well by site by hour to manage daily decisions. Actual daily sales activity is monitored to replan just- in-time inventory replenishment that, in turn, impacts longer cycle purchase planning.

A real-time acquisition and delivery engine enables on-the- fly analysis while ratcheting up seamlessly to support incremental aggregation and trend-line confirmation. The goal is to support immediate research of abnormal conditions in a manner not supported by the OLTP system. For example, a real-time transportation system getting satellite fed data of carrier deliveries indicates that an abnormal number of shipments is unloading behind schedule. On-the-fly analysis reveals the slow handling is for customer self- unloads and that the affected shipments are a brand-new multiproduct configuration. This information is used to quickly write and transmit new breakdown instructions to these customer sites.

Incremental aggregation is an alternative to the old method of creating running totals. At regular intervals, such as every 15 minutes for securities or every 30 to 60 minutes for product orders or bank transactions, an aggregation engine runs against the real-time partition to create intermediate totals and statistics. These aggregates can be cascaded into long-period aggregates (such as daily) and then stored in the static partition of the data warehouse. Because this aggregation is done more continuously, it can actually support higher volumes than more traditional daily or weekly bulk processing.

Trend-line confirmation builds on incremental aggregation to provide a new monitoring tool. Any time- sensitive activity, such as dropped calls at cells sites, can be compared to a baseline at close to real time. This provides much more insight and control than even active "stop lighting," which only tells you when cell sites have already overloaded. Trend-line confirmation reveals the momentum and direction of a change that is out of the ordinary soon after it begins.

Transitory States

Even if your enterprise is not (yet) running near real time, you may be prone to the effect of time compression. A key business process may be impacted by transitory states. These are short duration activities that may end in some form of reversal that leaves little or no evidence behind.

For instance, a credit hold is a temporary condition in an online order-taking system. An order goes on hold and then comes off, either with the order approved or disapproved. For an approved order, you may not know that it went on hold at all. At most, there is a residual hold indicator in the system. Disapproved orders will generally record a credit rejection reason. But, in neither case are the details of the hold event recorded. What happens when you introduce a new product that appeals to a consumer segment with lower credit worthiness? The number of orders per representative may decline drastically because of hold delays, but you cannot see this effect directly. The number of disapproved orders may increase, but this does not account for the lower number of orders per rep.

With a real-time acquisition mechanism, you can capture the on-hold and off-hold events as they occur. Due to the ability to quickly reconfigure the data stream to capture new events, you can begin analysis of a new phenomenon soon after you first notice it. Combined with other state transition events, you can create a contemporaneous record that can be used for analyzing the full order cycle.

Infinite Snapshots

Many data warehouse traditionalists will already be planning their rebuttals. Real time and warehousing just do not mix! Warehouses are about snapshots that provide a consistent point-in-time view of the data. They argue that real-time data capture violates this treasured principle.

The enlightened perspective is that you cannot achieve point-in-time consistency without continuous, real-time data capture. The point-in-time that is relevant for one use may be radically wrong for another. Some uses may need to count orders taken but not released, released but not shipped, invoiced but not paid, etc. The most flexible data design does not replicate the document concepts of OLTP system. Instead it captures information about discrete states in a business process as they occur.

With real-time and cross-application data capture, we are finally free to implement designs based on event states rather than documents or transactions. Continuing the order example, each row in a state-based table is a snapshot of what is known about the order when it reached that state (taken, picked or shipped). In a sense, a state table is a far superior "snapshot," since every row in it represents what is known about the order at a consistent state in the process.

Inventories and balances are a unique category of snapshot. They record the quantity or level of some tangible asset as of a particular valuation point. Whether they are numbers that reflect actual inventory counts or are derived by netting in-flows and out- gos, they are defined by the discrete time of their calculation. Today, the number of relevant "as-of" times is increasing. You might have month end immediate for one use, month end after the first weekend of processing for another and month end final after all adjustments. Weekly, daily, shift and even instant balances now complement these long-cycle views.

The best, and possibly only, way to account for all the possible variations involves continuous sampling using real-time techniques. Any balance required for legal, analytic or other reason needs to be frozen for posterity. We do this by writing it to the static data warehouse partition. A data warehouse must be designed to collect a large number of snapshots based on the consistency goals of the driving need.

The technologies to support real-time data warehousing are coming online today. The early adopters are already developing the techniques. The ball is in now in your court.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access