Data warehouses grow because they contain detailed, historical data, summary data and free-form extraneous fields of data that might or might not be used in future DSS analysis.
The terminology used to describe the volumes of data is changing. Not so long ago, programmers were worried about programs that exceeded K-bytes of data. Then on-line designers fretted over megabytes of data in their OLTP databases. The term "gigabyte" was first heard in the early days of the data warehouse. Many corporations are just now viewing their first terabytes of data, which will soon be followed by petabytes of data.
Not only is the volume of data that accompanies data warehousing breathtaking, but the speed with which that volume is achieved is unprecedented and often catches the data management organization unaware.
The volumes of data which accompany data warehousing carry with them design and operational considerations which have never before been encountered. Nearly all of the data management practices which grew up with databases of a smaller magnitude have had to be revised. In the face of very large volumes of data, the following practices/techniques have to be recast: indexing data, defining and maintaining database integrity (referential integrity), tuning a database, loading data into a database, doing full table scans, monitoring the database, and so forth.
The world of data management in the megabyte range is a very different than data management in the terabyte range.
The way that most corporations manage growth is by choosing the proper hardware environment. Many corporations start out with an SMP environment while their warehouse is small or of intermediate size. It is only in the really large warehouses that the architectural differences between SMP and MPP become manifest. At this point in time, choices like NCR with Teradata become very attractive.
Of course, the software that resides on top of the hardware platform cannot become a constraint to performance. The basic DBMS software needs to take full advantage of the sophistication of the hardware foundation.
The first step up the ladder in addressing the issue of volumes of data is that of choosing your hardware platforms carefully. But what if your volumes of data continue to grow and push the limits (technological or economic) of the MPP environment. What's the next step?
The next step is to create multiple enterprise data warehouses. With multiple enterprise data warehouses, large volumes of data can be divided into smaller physical sets of data. The strategy of splitting data into multiple databases works only when the data put into those separate physical databases follows a natural division of the business where there is no need for integration.
What happens when a corporation is faced with the choice of having to create a large central store of data that is tightly integrated? This circumstance turns out to be the case more often than not. Typical of these circumstances are customer files, sales files, marketing files and the like. When a corporation is faced with a large amount of data that overwhelms even the capacities of an MPP, the result is to look at the architecture of the data warehouse a little bit differently than in the manner suggested by the classical data warehouse architecture.
Figure 1 shows an architecture that can handle very large (virtually unlimited) volumes of data that are tightly integrated and shows that the very large data warehouse is centralized. In Figure 1 the data warehouse has grown beyond the size that is economically held on an MPP platform. There are three architectural components to the very large data warehouse shown in Figure 1: near-line storage, an activity monitor and a cross-media data manager.
Near- line storage is storage that resides on other than disk. Typically near- line storage includes photo optical storage and siloed tape storage. Near-line storage is sequential and has other physical properties, such as not being able to be written to more than once.
But near-line storage is much less expensive than disk storage. Photo optical storage costs about $0.42 for the same amount of storage that would cost $1.00 on disk. Siloed tape storage costs about $0.07 for the same $1.00 worth of storage on disk. There are then significant savings to be made by placing the bulk of the data in the data warehouse on some media other than disk storage.
On the retrieval side, in order to access the first record of data, the seek time is decidedly more than a few milliseconds. However, once the first record is found, whole blocks of data are moved in bulk to the disk storage device. Depending on the number of records brought over in a block, the average seek time for a record residing on near-line storage can be quite low. But given the nature of the data put on near-line storage and the probability of usage of the data, placing the bulk of the data found in the warehouse on near-line storage makes sense.
Because of the cost of near-line storage, when a warehouse can be extended from disk to photo optical or siloed storage, the warehouse can grow MUCH, MUCH larger than it ever could if it were housed solely on disk storage. Furthermore, the growth is not expensive and performance is only marginally impaired.
The second component of the architecture that accommodates very large amounts of data in a warehouse environment is that of an activity monitor. The activity monitor sits between the end user and the server on which the data is placed. The activity monitor looks at and gathers all kinds of information about the running of the DSS environment. The activity monitor concerns itself almost exclusively with end-user query activity.
While the activity monitor looks at a wide range of data, the most important piece of information accumulated by the activity monitor is what data is being accessed and what data is not being accessed. Once an organization knows what data is not being accessed, then the organization knows what data is safe and prudent to move from disk storage to near-line storage.
In order to make the optimal placement of data on disk and on near-line storage, you need to have the intelligence that is gathered in an activity monitor.
One of the fine distinctions that can be made by an activity monitor is between the different types of accesses made against a table. An activity monitor can determine what rows are/are not being used, which type of row within a table is/is not being used, what predicates are being used, and so forth. In order to optimize the placement of data across the different types of storage, it is necessary to have a wide variety of information that can only be gathered by an activity monitor.
The third component of the architecture for the management of very large amounts of data across disk and near-line storage is a cross-media data manager. A cross-media data manager is software that can manage data on disk storage or can gracefully move data to and from disk storage and to and from near-line storage. For example, Oracle can manage the data on disk and the cross-media data manager can manage data on near-line storage. In another scenario, the cross-media data manager can manage the data on both disk and near-line storage.
An example of a cross-media data manager is FileTek's StorHouse. When a query is created, StorHouse decides whether the data requested in the query is on disk or near-line storage. If the data happens to reside on near-line storage, StorHouse accesses the data and places the data on disk directly from near-line storage. Of course, the query runs a bit slower than if all the data required had been placed on disk in the first place; but the amount of data that is actually required to remain on-line is significantly less than if all data were placed on-line. The result is that the costs of storage and CPU are significantly reduced by using FileTek's StorHouse.
One of the interesting features of StorHouse is that indexes can be stored on disk regardless of the physical housing of the data. In other words, the index to data stored on tape can be stored on disk. This gives StorHouse the ability to intelligently select and qualify data even when the data does not reside on disk storage.
The architecture shown in Figure 1 then is extendable to manage a VERY, VERY large amount of data. Response time is not sacrificed to any great extent, and the cost of the environment remains reasonable. It is noteworthy that all three components are required--near-line storage, an activity monitor and a cross-media data manager. Without the three components in place, the environment that is created is less than optimal.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access