Continue in 2 seconds

Real-Time Data Warehousing 101

Published
  • October 09 2003, 1:00am EDT

DMReview.com welcomes Simon Terr as the newest online contributor. His "Real-Time Data Warehousing" column will appear on the second Friday of each month. He will delve into the complex issues surrounding real time and data warehousing.

I recently assisted in the hiring process of an ETL developer for a client's data warehousing group. Each of the people I've interviewed had strong knowledge in one area or another of data warehousing but none had a strong overall grasp. Therefore, this introductory column will paint the bigger picture so readers will have an overall grasp of data warehousing and, in particular, real-time data warehousing. In future columns, I will address several fundamental topics that include data warehousing basics, definition of real time data warehousing, real time versus near real time, real time in online transaction processing (OLTP), the case for real time and real-time data warehousing building blocks.

Data Warehousing Basics

For starters, let's get down to the very basics and define data warehousing. Data warehousing is the process of creating, populating and querying a data warehouse. Ralph Kimball defines a data warehouse as "a copy of transaction data specifically structured for query and analysis." The creating activity encompasses the data warehouse's data model and physical database. The populating activity encompasses the ETL processes that pull data from the OLTP systems into the data warehouse. The querying activity is the gathering of information from the data warehouse. In short, data warehousing is a framework for deriving information from data.

Next, we need to address the differences between information and data. Data is a discrete fact or set of facts. Data alone has no context or meaning. Consequently there are no actions or questions that can be taken from data alone. To illustrate this consider the fictitious Acme Corporation where, in January 2003, Acme has had 1,000 widgets in its Van Nuys, California warehouse. Although there are several facts here that include what (widgets), where (Van Nuys warehouse) and when (January 2003), there are no meaningful questions or actions that can be taken from this set of data.

On the other hand, information is the meaning of data. While data is discretely defined, information is derived from data. Information is trend oriented and lets us ask questions and take actions. To continue with the fictitious Acme Corporation example, consider that Acme had 1,500 widgets in its Van Nuys, California warehouse in February 2003. When this data is combined with the earlier data, i.e., the 1,000 widgets in January 2003, we can derive information from this data. One piece of information is that the inventory of widgets in the Van Nuys, California warehouse has increased by 50 percent from January to February. This information lets us ask questions such as why the inventory increased so much in one month, were there problems with the business between January and February, and so forth. The information lets us take actions such as starting an inventory control system.

Data is simply facts, nothing more nothing less. No meaningful questions or action can be taken on data alone. Information is derived from data and information is the catalyst for asking questions and taking action.

Definition of Real-Time Data Warehousing

Real-time data warehousing is a combination of two things: real-time activity and data warehousing. Real-time activity is activity that's happening right now. The activity could be anything, such as the sale of widgets. Once the activity is complete, there is data about it.

Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.

Real Time vs. Near Real Time

The difference between real time and near real time can be summed up in one word: latency. Latency is the time lag that's between an activity completion and the completed activity data being available in the data warehouse. In real time, the latency is negligible; whereas, in near real time the latency is a tangible time frame such as two hours.

Real-Time OLTP

The first alarm bell here should be the word OLTP. Date warehousing and OLTP are two different types of systems regardless of whether real-time capability is present in either system. Real time in the OLTP space typically means the enabling of data synchronization across OLTP systems at real-time speeds. Take for example the fictitious Acme Corporation that happens to have order entry and order fulfillment OLTP systems. With real-time OLTP capability enabled, as soon as a sale is recorded in the order entry system, the order fulfillment system has the sale recorded, as well. The message here is that OLTP and data warehousing systems are different systems regardless of the features.

The Case for Real-Time Data Warehousing

Why embark on real time data warehousing? There are several compelling reasons.

First, the world operates on a 24x7 schedule. Consider that at any time there is some business activity going on somewhere around the globe whether it's someone buying an automobile in Sidney, Australia, a cargo ship carrying automobiles across the Sea of Japan or a forklift operator stacking containers chock-full of toys onto a train in Germany.

Second, the Web makes businesses operate in right-now-time. Most electronic media such as software, music and videos can be purchased online and downloaded right away. Although other goods such as books and bicycles are shipped to the doorstep usually by the next business day, their sale is recorded as soon as the submit button on the order Web page is pressed.

Lastly, technology has matured to such an extent that real- time data warehousing can be a reality with the right tools. There are mature extraction, transformation, loading (ETL) software products that enable data acquisitions in real time. Enterprise reporting tools enable reporting of data and information without latency and across many platforms.

Real-Time Data Warehousing Building Blocks

In order to successfully incorporate real-time capability into either an existing or new data warehousing environment, several critical success factors have to be addressed including goals, road map, customers, budget, system architecture, hardware, software and staff. If there are other critical success factors that are specific to the organization or situation, these need to be explicitly identified and included in this process.

The purpose of goals is to bridge the gap between vision and reality by identifying what needs to get done and why. Real- time capability is not trivial to implement, so choosing specific targets of opportunity with tangible paybacks are very important. When choosing the goals, consider the opportunities and the reasons as to why real-time data warehousing is the best solution. The message here is that a good business case is as vital as the goal.

Next, a road map needs to be drawn up. The road map communicates what is going to be done, how it is going to be done, when it is going to be done and in what order it is going to be done. In other words, the road map is the list of goals, their priority, the project(s) behind each goal and delivery time frames for each project and goal. The road map does not have to be perfect initially. As the road map is socialized with customers, their feedback is going to refine the goals, priorities, projects and delivery time frames. Once all impacted parties agree to the road map, it serves as a guiding document throughout the implementation process. The road map is a live document and it should be periodically reviewed with all impacted parties and, if need be, refined.

The costs for enabling real- time data warehousing are going to be split across the customers whether they are external, internal or both. Therefore, the customers need to be actively involved and they should have a stake in as many projects as possible. Further, the customers should expect and get tangible benefits after the implementation process is complete. These benefits tie directly back to the paybacks identified during the goal - creation process.

Estimating how much money is required is a function of knowing the staff and equipment needed for each project on the road map. The funding process is iterative in nature and the costs identified should be revisited after all the critical success factors have been reviewed. One thing to remember here is that costs should not be underestimated under any circumstances. Underestimating costs is the surest and quickest way to lose credibility and, ultimately, funding.

Real-time data warehousing requires appropriate system architecture to support it. A system architecture review needs to take place in order to identify source data systems, staging and target databases, data acquisition and integration strategies, network and server capacity requirements, development/QA/production environments, and error and recovery processes to mention a few factors.

The system architecture review should flush out gaps in hardware and software and, perhaps more importantly, illuminate the type of hardware and software needed. If the appropriate hardware and software tools are not in place, they should be purchased and utilized. It is likely that some of the purchases may require formal tool selection projects and that's okay. Using a tool that does not meet needs is like using a hammer to build a house - it could be done but with great difficulty.

Implementing real-time data warehousing requires skilled and knowledgeable staff. The challenge is identifying the appropriate staff to implement the projects and whether the staff is going to be external, internal or both. The road map and system architecture should be the guiding documents used in matching staff to projects and delivery timeframes.

Putting It All Together

Real-time data warehousing is complex but very doable. Understanding is the first step to success followed by methodical and careful planning using critical success factors. These building blocks may not be glamorous or full of hype, but they go a long way toward a successful implementation of real-time data warehousing.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access