Advances in technology render real-time data warehousing a more viable option and cause organizations to consider the possibility.

 

The first in this series addressed the questions “what is real time?” and “When and where is real time valuable?” This article addresses the question “How is real time achieved?” It’s a daunting task when facing a business justification and requirement that a data warehouse deliver data in real time. While this article can’t provide a detailed design for every infrastructure/architecture, it will establish a set of starting points from which a detailed design can develop.

 

When in Rome, Do as the Romans Do

 

Initially, it is important to consider your architecture, specifically on which architecture will the new real-time application reside. The answer is deceptively simple: an architecture that already exists in your application infrastructure. Find the architecture in your IT shop that most closely corresponds to the concept of real time, and use it. Possibilities include:

 

  • Loosely coupled and asynchronous,
  • Tightly coupled and synchronous as well as
  • Rapid batch.

The creation of your first real-time application is not the time to create a real-time architecture. If no suitable real-time architecture is available, put the creation of a Real Time application on hold until a predecessor project creates a real-time architecture. Only after the real-time architecture is in place can you proceed with the creation of a real-time application.

 

Just the Facts

 

When your application infrastructure includes a real-time architecture, you must decide what data will utilize that real-time architecture. The implication is that, within a set of real time data, not all the data will be integrated in real time. Typically, only transactions or events (i.e., fact data) are integrated into a data warehouse in real time. Specifically, these are those business events that occur at discreet moments in time and have been deemed to have a ROI sufficient to justify the investment in a real-time application.

 

Dimension table updates, snapshots, summaries, aggregates and fact data with a low ROI are typically not integrated into a data warehouse in real time. They may be updated on a more frequent basis to prevent the timeliness of the real time fact data from degrading. The point is that real-time data does not necessarily include all data. Rather, real-time data only includes the specific data elements for which the ROI justifies a real-time application.

 

Tax, Tag and Title

 

Now that you have a real-time application infrastructure and real-time data elements, you must decide how to move them from the source system to the data warehouse. This will seem a bit non sequitur at first, but rather than a nonstop flight from the source to the data warehouse, include a layover in a staging environment. Why? So your real-time application can:

 

  • Apply a tag to groups of records as they flow through, possibly including timestamps, group numbers and quasi batch numbers,
  • Inspect the quality of the data,
  • Perform any calculations or algorithms on the data and
  • Perform any lookups to find foreign keys to related dimension data.

This is where most real-time applications miss. Real time does not mean that we sacrifice good application design, data quality and relational integrity in favor of speed. Instead, apply the principles and methods of good application design, data quality and relational integrity via a staging environment.

 

Typically, a staging environment is a file or table that can hold a complete batch of data, hence a batch application. In a real-time application, a staging environment resembles staging data in interactive applications:

 

  • The common area in a corporate information and computing services (CICS) application is passed from one CICS screen (i.e., program) to another.
  • Web-based applications pass data from one page to another.

Likewise, a real-time application passes a record or set of records from one staging application to another, much like a group of CICS screens or Web pages. One staging application looks up foreign keys, another assesses the quality of the data and another reformats the data prior to loading into a data warehouse.

 

Figure 1 shows a set of real-time applications. The first application receives and catalogs input data: it then passes the data to the second application. The second application performs lookup functions to obtain surrogate and foreign keys and then passes the data to the third application. The third application assesses the quality of the data, and then passes the data to the final application, which loads the data into a data warehouse.

 

 

Throughput, Throughput, Throughput

 

This last piece of the puzzle is similar to the saying about real estate: What are the three keys to a good piece of real estate? Location, location, location. The same is true of real-time data. What are the three keys to a good real-time application? Throughput, throughput, throughput.

 

This is where the investment in real time starts to sound like a real investment. Real-time data requires significant investment in real-time hardware, licenses, bandwidth and anything else that contributes to throughput. The best architecture and design will fail miserably if not implemented with enough throughput to handle peak data volumes. After the investment in a real-time application, you don’t want to explain why data appears an hour later, because it’s a period of peak data volume. That kind of throughput is not a real-time application; rather, it’s a very expensive batch application.

 

A real-time application is similar to a batch application. While batch applications reside on a batch architecture, real-time applications reside on a real-time architecture. The same best practice methods and designs applied to batch applications can also be applied to real-time applications. However, while batch applications stage and deliver all the data in one set, real-time applications stage and deliver data as it arrives.

 

Now that you have a real-time application, you need to know how the source system will do its very best to break it. The third and final article in this series of four articles will answer the questions “What causes the biggest headaches, and what can you do about it?”

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access