Data warehouse implementations require a strong operational infrastructure to support the daily tasks necessary to keep the system running smoothly. Defining and implementing the components required to build a solid foundation is not a trivial task. Daily operational tasks consist of processes such as source file archiving, data loads, aggregate table builds, dimension table updates, integrity checking and index creation. Managing such a range of tasks and their associated business logic can be difficult. In addition to the difficulty of managing these tasks, business requirements often necessitate executing operational tasks during non-business hours and in a limited time frame. The purpose of this article is to explain how to manage daily operations by combining warehouse task scheduling with operational meta data to build an operational infrastructure that supports continuous improvement and process control.
Building the Foundation
The foundation of a robust operational system is an automated task scheduling utility. Central to the importance of a task scheduling utility is its ability to allow the definition of schedules which can be executed with little or no human intervention, thus automating the warehouse load, update and backup process.
Automating warehouse operations will allow staff resources to concentrate on more important matters such as the functional layers, which we will look at shortly. Most warehousing projects recognize the importance of automating day-to-day operational tasks but fail to capitalize on the value added by building extra functionality into their task scheduling tools. It may be tempting to implement an automated scheduling utility with minimum functionality one that executes a task, checks for a return code and, depending on success or failure, either starts the next task or exits. Actually, there is much more you could be doing for your warehouse operations than using a glorified version of cron to allow unattended data loads.
At this point you may be asking yourself, "What else could I possibly need beyond automating daily processes?" The answer is: the ability to collect data about the operations being automated.
Capturing Operational Meta Data
Used correctly, an automated system for executing warehouse tasks is capable of not only managing the day-to-day requirements of a warehouse but also assisting in the capture of important operational meta data. Information such as total execution time, task output and error codes should be captured by the scheduling utility for each task. It is also useful to keep statistics relating to CPU, memory and I/O usage during each task performed against the warehouse. Meta data should be captured at a significant level of granularity, meaning it is probably not beneficial if statistics are captured on a task that does too many things. As a general rule, limit tasks to executing no more than one or two processes.
This operational meta data is more useful than you may think. Undoubtedly, sometime during the course of operations you will be faced with the need to run a warehouse process during a time when it is not normally run. How long will it take to run that process? Will it fit into the time frame allotted, along with the normally scheduled processes that are supposed to be run? If not, how much spillover into normal business hours will there be? Questions like these are prime examples of the importance of capturing different statistics about each task. If we have captured execution times for each task, these questions are easily answered. Better decisions can be made about special requests and user expectations can be managed based on the knowledge produced from our operational meta data database. Utilizing total execution time statistics to make better decisions is only a small example of the many benefits of capturing operational meta data.
Closing the Loop
Once the functionality is in place to capture operational meta data, it is time to turn your efforts toward analyzing that data and giving feedback to the developers and operators responsible for the system. One of the most obvious analyses to be made is on the tasks that are frequently exiting with error codes. Do these tasks exit with the same error code every time? Is there a correlation between these failed tasks and CPU or memory usage? What other conclusions can be made from the frequency and time of failure? Using the data collected can be an invaluable tool in identifying the underlying causes of many problems that threaten the success of daily operations. Once problems are identified, the feedback process can take place to initiate improvement in the operational system and the underlying warehouse processes.
Analysis can also be made to identify processes that are starting to move out of allotted control limits. Trending process execution time over the life of that process can help identify processes that, if left untouched, could impact your ability to complete operational schedules in their allotted time frame. For example, suppose an aggregate table build is expected to execute in a specific amount of time, but over the past several weeks execution time has trended steadily up. When did the problem start to happen? Has CPU usage or available memory decreased during the execution of the process? Further investigation may show memory available has, in fact, decreased during process execution due to a configuration change you made at the same time the problem began to manifest itself.
Without the analysis of collected operational data, it would have been much more difficult to pinpoint the cause of this problem, and there is a good chance you might not have caught the problem until it turned into something more severe.
I have only scratched the surface of possible analyses that can be performed on the data collected from our warehouse operations. There are many other analyses that can be performed on data, such as CPU utilization to identify tasks that are constrained by available CPU resources, I/O statistics to identify devices that are bottlenecks and the process coinciding with those bottlenecks, and memory statistics to identify processes that are not correctly freeing memory when finished, to name a few.
The Human Side
Up to this point we have focused mostly on capturing information generated by different hardware and software components as part of our operational infrastructure and have ignored capturing the knowledge possessed by people who have daily contact with the warehouse processes. The importance of capturing staff knowledge has been overlooked for a long time, but recently more attention is being focused on valuing and managing human capital. One catalyst for the increased attention is the staffing shortages being experienced in the IT industry. With so many open job positions and a limited number of qualified IT professionals to fill those positions, companies find themselves competing with one another to attract employees by offering higher salaries and better benefits. Combine this with the opportunity of learning new technology at another company or position, and you have an almost absolute guarantee of experiencing some level of turnover on your warehouse project team.
Every time a team member leaves they take with them some things only they may know. Not only is information lost when someone leaves, but information not shared often has to be re-learned by the remaining team members. This means team members find themselves dealing with problems previously solved by others, but the solution was never communicated and the associated knowledge was not captured. The time and money required to re-learn knowledge that should be common to the company since it was information captured for the express purpose of the company's use can be substantial. Integrating a knowledge management component to the operational infrastructure aims at minimizing the loss of knowledge by facilitating communication about operational procedures.
A primary example of knowledge management as it relates to an operational infrastructure is the need to document steps taken to resolve tasks exiting abnormally. If a schedule errors out at a specific task, the person responsible for resolving the problem should be required to outline the steps taken to resolve the issue and add them to the meta data for that task. What was once knowledge held by one person is now knowledge documented for others to use. If the same problem was to occur again on that task or a task similar in nature, the person responsible at that time has a knowledge database to search for a possible solution. Instead of expending much effort and time re-learning the solution to the problem someone else has already resolved, a team member can simply query the database of knowledge integrated into the scheduling system.
Knowledge can be captured not only pertaining to resolving problems, but also about such things as optimal configuration parameters for a specific task or type of task, caveats related to restarting a task after it has failed and contact numbers for vendors or other divisions in the company responsible for supporting specific task functionality.
One Step at a Time
The time required to implement a robust operational support system is substantial. Operational support systems are often implemented as an afterthought and are not given the attention necessary to make them useful in process improvement or operation process decision making. Offering a higher degree of service to the departments our systems support and protecting the investments made in support personnel require giving more attention to operational infrastructures.
Each level of operational system functionality discussed in this article offers significant benefits. Building a system of this nature demands time and commitment, but everything does not have to be implemented simultaneously. Begin to build your operational infrastructure by implementing the initial steps discussed and continue to add levels of functionality one step at a time. Soon you will have a solid operational foundation, which not only simplifies daily operations, but also supports continuous improvement and process control.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access