To ensure that your data warehouse is scalable, you have to view all of the hardware and software components and all of the processes that are part of data warehousing (such as extracting, cleansing and transforming data) as parts of an overall "performance chain." If any component or process is not scalable, then you have a weak link in the chain, and the warehouse as a whole will not be able to scale very well. It doesn't really matter where that non-scalable component or process is in the chain its mere existence will eventually create a bottleneck. And, if you have a bottleneck, you will not have a true "organic data warehouse," meaning your warehouse will not be able to organically grow and adapt as rapidly as your organization's needs increase and change. So, we need to be vigilant about ensuring the scalability of all components and processes. However, what happens when we have some legacy components that we still want to use in our new warehouse, but which where built a long time ago and weren't originally designed to be scalable? For example, we may have very large and very complex batch-processing programs written in COBOL that contain critical data manipulation routines. In many cases, due to their complexity, rewriting these programs (in order to make them more scalable) may be a very expensive proposition. In other cases, any individual legacy program may not be very complex; but there may be hundreds of these programs that were developed over time, and each one would have to be rewritten. Again, the time and the resources needed to rewrite all these programs might be prohibitive.
A similar issue can arise if we want to use off-the-shelf applications that simply weren't written to take advantage of scalable hardware platforms and, therefore, cannot scale in their generic form. Or, there may be cases where your data warehouse will need some custom code, forcing you to write scalable/parallel applications (which may require that your developers become trained in how to write scalable/parallel programs).
Until recently, organizations that found themselves in this situation had very few alternatives, if any. But, fortunately, a clever new class of tools is now available that can help solve the problem. Rather than forcing you to rewrite legacy programs, or live with non-scalable off-the-shelf programs or write parallel programs from scratch, these "scalable application frameworks" can take non-scalable programs and, by leveraging the concepts of replication and data parallelism, allow these programs/routines to be run in parallel. Two of the main vendors providing such tools are Torrent System, Inc. (www.torrent.com) and Ab Initio Software Corporation (www.init.com).
These frameworks create multiple copies (known as "replicated" copies) of the routine and assign them to run on separate processors. Then, the frameworks make heavy use of data parallelism/partitioning techniques and divide the data so that each of the replicated copies only has to work on a portion of the data. In essence, what these frameworks do for you is automatically create the scenario depicted in the lower half of Figure 1. The upper half of the figure shows a non-scalable approach, where the extract, cleanse and transform routines are each written to run as a single process. This will work fine until the amount of data in the source system grows beyond the throughput capacity of any of these single processes.
Scalable application frameworks solve this problem. Rather than having to rewrite the extract, cleanse and transform routines to be able to run in parallel, these frameworks will automatically create multiple copies of each routine (three additional replicated copies are created in this case), run the copies on different processors and then divide the input data into four streams so that each copy only gets one-quarter of the overall data.
Though the description of the general functioning of these frameworks is straightforward, what actually happens behind the scenes is far from simple. In fact, quite a bit of magic needs to occur to ensure that everything works as intended. In particular, automatically dividing the data is not trivial. If your non- scalable routine expects to open a file, read the data from beginning to end (processing it as it goes), and then write out the results into a single file,what happens when you now have four copies doing this? These frameworks work their magic by "fooling" the routines, often by incorporating what could be considered operating-system extensions which transparently implement portions of a parallel file system on your platform. This allows the frameworks to intercept the I/O requests coming from the routines and manipulate the I/O as necessary.
For example, when the routine issues an I/O request to open a file and start reading from the beginning of the file, the framework catches that request and determines what to do with it. In the example in Figure 1, if the request is coming from the third replicated copy of the extract routine, then the runtime components of these frameworks will automatically and transparently start returning data from the third partition of the original data file, rather than from the beginning of the file. The original routine simply thinks it is reading from the beginning of the file. And, when it reads the last piece of data from the third partition, the framework will automatically return an "end-of-file" message to the routine, making the routine think it has finished reading the entire file. Even more complexities exist with the creation of output files (that is, we might be expecting one sorted output file, not four smaller files), but these scalable application frameworks are sophisticated enough to ensure that the right things occur.
My colleagues and I have used these tools a number of times, and each time they have been very helpful. In essence, these frameworks make it far easier to ensure that all of your software components will be scalable. They can be used to make legacy or off-the-shelf applications highly scalable with little or no modification required, and they can remove the burden on the application developers of having to write parallel code. They can help save time and money when building a high-performance scalable data warehouse.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access