Continuing with the series on real-time data warehousing building blocks, in this installment I am going to discuss three building blocks: customers, budget and system architecture.
Because customers and budget are well-published topics, I’m going to briefly touch upon them and dedicate the rest of this article to system architecture. If you feel that you would like me to elaborate in detail upon customers and budget, send me an e-mail and I will revisit these topics in a future article.
I am a firm believer in using methodologies when interacting with customers. Although there are a variety of proven customer and project management methodologies, they provide a framework that ensures customers are involved in all key phases and milestones of any undertaking of some significance. Regardless of which methodology you use, select one that’s appropriate for you and use it.
A budget should realistically reflect the costs associated with the successful delivery of goals and projects identified in the road map. Taking into account the anticipated system architecture, hardware, software and staff prior to finalizing any budget will uncover any additional hidden expenditures, thereby making the budget that much more accurate.
A system architecture for real-time data warehousing is comprised of four major components: volumetrics, data acquisition method, data recovery architecture and storage.
- Volumetrics. Volumetrics’ purpose is to provide enough information to determine whether the current systems can handle the demands of real-time data warehousing today and tomorrow and, if the systems cannot handle the demands, identify the challenged systems. Size of source tables, available network bandwidth, storage availability, server capacity, current and anticipated consumption rates of resources and data growth rates are some of the key metrics that comprise volumetrics.
- Data Acquisition Methods. In real-time data warehousing, change data capture is the modus operandi. Attempting to do real-time data warehousing by replicating whole tables, database partitions or the like is simply not practical unless you have a requirement for it. With this in mind, here are some of the more popular change data capture methods and how they work: application, log, replication, trigger and continuous.
The application method works by incorporating acquisition code directly into the source system that generates the source data. The premise here is that once a transaction is complete, the source application commits the transaction in its own database and in the data warehouse.
The adage "straight from the horse’s mouth" is pretty accurate here. Since there are typically no intermediary systems between the source system and the data warehouse, this is a straightforward and pretty enticing data acquisition method.
Unfortunately, this method has a couple of significant limitations. For custom and in-house built applications, the first limitation has to do with the acquisition code becoming part of the source system. This integration means that the acquisition code is tied directly to the code base and life cycle of the source system. Hence, any changes to the acquisition code will have to be made by the source system developers unless the data warehouse development team can arrange to have a development role in the source system. In the case of packaged application, this method may not be even feasible.
The second limitation has to do with doing double work. Every time the source system writes a transaction to the database, it also writes a transaction to the data warehouse. Hence, the source system is doing double work and incurs a significant performance burden.
The log method works by acquiring change data from the database transaction logs. As this method utilizes log tables, acquisition code does not have to be incorporated into either the source application or its database. Database logs are well documented and reading them does not typically require unusual procedures or code. Furthermore, this method is utilized by several ETL tools and data replication engines making the method relatively popular, at least among ETL software vendors.
The replication method works by acquiring change data using a data replication engine. Since data replication engines typically utilize database logs, this method shares many similarities to the log acquisition method. Data replication engines typically provide queues and robust error handling that make this method interesting enough that several ETL vendors utilize data replication engines to provide change data capture functionality. Typically, data replication engines do not include data transformation functionality so an ETL tool is frequently a necessary component of this data acquisition method.
The trigger method works by adding triggers to the to source system to capture changed data. Triggers are robust and can be tailored to apply to specific database tables, fields, and events thereby capturing only the data that’s desired. Furthermore, triggers are traditionally compiled and the compiled code is stored directly in the database server thereby making triggers fast. If not used in conjunction with an ETL tool, triggers can get complex and unwieldy over time.
The continuous method works by having an ETL process continuously monitor the source system for new and changed data. Some ETL vendors utilize data replication engines to implement continuity while others offer special versions of their software, typically dubbed "real time", to provide this functionality. Continuous method is probably one of the best ways to achieve real time performance simply on the virtue that all systems are integrated and working together. Unfortunately, this is one of the most complex architectures to recover from failure. Unless there are is robust data recovery architecture in place, the failure of one component will ensure an adverse cascading effect to all downstream components.
Data Recovery Architecture. Although they vary on a case by case basis, there are three common data recovery architecture guidelines to consider. The first guideline is to schedule processes around maintenance periods. Typically every system has regular scheduled maintenance windows during which system updates, backups, and other administrative tasks get done. Systems should be scheduled to not run during maintenance windows to avoid dropped connections and other system dependency related issues. Decoupling systems is another guideline. Decoupling simplifies data acquisitions by breaking them up into smaller, logical units while minimizing the impact of failure by any one system. To illustrate this guideline, consider a system that has an ETL process that moves data from a source system to the data warehouse. If the source system becomes inaccessible, the ETL process typically experiences a dropped connection condition followed by failure. To reduce the impact of the source system being inaccessible, the source system and data warehouse can be decoupled by setting up two data acquisitions. One data acquisition moves source data into a staging area while the second data acquisition moves data from the staging area into the data warehouse. The first data acquisition can be a simple data movement affair while the second data acquisition can contain the core ETL logic. If the source system becomes inaccessible, only the first acquisition will require a restart while the second acquisition will retain its integrity without any significant impacts to it.
The third guideline is to plan for downtime. To illustrate this guideline, consider a system that has two data acquisitions: acquisition A moves source transactions into a staging database and acquisition B moves transactions from the staging database into the data warehouse. Let’s say acquisition B fails while acquisition A is still up and running. In this case, planning for downtime implies two things. First, the staging database has to big enough to hold accumulating transactions while acquisition B is down. Second acquisition B has to be fast enough to process the accumulated transactions and be able to catch up to the new accumulating transactions.
Storage. Real-time data warehousing has several impacts on storage such as increased data storage and timely data retrievals requirements. Although there are a variety of storage options, each one has its own impacts on the system architecture. For instance, the impacts of a SAN versus Fibre Channel solution on the network and system architecture as a whole are very different. Atomic and even fact tables tend to grow to millions and sometimes billions of rows and retrieving data from a large table can be a challenge. Traditional performance enhancing techniques such as indexing are sometimes insufficient to gain acceptable performance. Consequently, alternative methods have to be employed such as breaking up the storage of a large table across different media such as hard disk and tape. The introduction of mixed media storage solutions to increase the performance of data retrieval, reduce storage cost or both impacts the system architecture as well.
All in all, utilizing a customer and project management methodology ensures that there is customer involvement in key phases and milestone events. Utilizing traditional budgeting techniques while taking into account the system architecture, hardware, software and staff requirements increases the accuracy of the budget. Finally, volumetrics, data acquisition method, data recovery architecture and storage are the key components of a sound system architecture that can support real-time data warehousing today and tomorrow.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access