Continue in 2 seconds

Data Warehousing

Published
  • November 01 1998, 1:00am EST

The world of data warehousing is characterized by explosive growth and transformation. Just six short years ago, the concepts outlined in Bill Inmon's book, Building the Data Warehouse, were perceived by many in the IT establishment to be just another "throwaway" idea--IT grasping at yet another silver bullet, technology's cure for what ails the legacy systems environment. So, six years later, how does one explain the phenomenal success enjoyed by data warehouse practitioners, advocates, industry analysts, trainers, consultants, lecturers and tools vendors? Quite simply, data warehousing makes business sense. It makes sense to business analysts, their managers and the decision-making executives in major companies throughout the world. It doesn't matter what the industry is, be it manufacturing, retail, banking, telecommunications or health care, there is a common thread linking people who make decisions--they want to make effective decisions and that means those decisions must be based on facts and data. These underlying facts and data, the heart and soul of the information-based enterprise, reside in the data warehouse.

The widespread acceptance of the data warehousing concept by the marketplace meant quite simply that vendors would follow with products and services. Technology vendors who had been dismissing data warehousing concepts as fluff and fly-by-night "techno-evangelism" changed direction so quickly that whiplash became an epidemic in every corner of IT vendor-dom. Hardware providers suddenly had the perfect data warehouse platforms. DBMS vendors discovered their solutions happened to be optimal for your data warehouse performance needs. The pent-up demand by businesses to build data warehouses created unprecedented opportunity for entrepreneurial pursuits in extracting, transforming and loading data, integrated data modeling and CASE tools, data repositories and data access tools. Most recently, data warehousing demand has been manifested in the data mart and products for OLAP and data mining tools.

One of the most important contributing factors to the tremendous growth of the data warehouse market is the natural tendency of the successful data warehouse to do just that--GROW! How does the data warehouse grow? First, the data warehouse development team deploys the initial version or iteration of the warehouse. Next, the business finds that the availability and accessibility of information never before available provides opportunities to discover important patterns of activity and business relationships. And, as word of the initial data warehousing success spreads throughout the user community, new uses for the information are discovered, more data is deployed and the warehouse grows. As the warehouse grows, more users want access, more hardware and more software are deployed, more data is loaded and historical data accumulates faster. And the cycle repeats itself over and over again. See Figure 1.

As a result of growth, data warehouse performance can quickly become an issue. Users are competing with many other users for access. Queries have to sort through much more data than they did when the warehouse was originally created. Costs begin to creep upward. Performance continues in a not- so-slow death spiral. Users begin to become dissatisfied with the DSS (decision support system) environment. Data warehouse developers begin to show their stress. See Figure 2.

What can a company do? To date, most businesses have responded in one of three ways: 1) they declare failure and shut down the data warehousing effort because they aren't able to meet cost or performance expectations, 2) they respond with conventional systems wisdom and throw more money at the problem to increase the capacity of the existing environment, or 3) they throw more and newer technology at the problem. In the case of alternative number 1, the company loses its competitive information advantage and still has to solve the problem of how to provide information to its stakeholders and decision-makers. The data warehouse merely goes "underground" and resurfaces at a later point in time, usually with a different name. Alternative 2 is a stop-gap measure which results in more people worrying about the problem or more hardware capacity to process the problem, but does not improve the situation in the long run. Alternative 3 is often the first step in the disaster of tomorrow; that is, the abandonment of the architected DSS environment for one where new structures or models or databases promise short- term improvement, but add to the complexity and costs of the DSS challenge in the long run. See Figure 3.

Thankfully there is another alternative. This is the alternative of practicing safe and sane data warehouse administration and management. Historically, IT has long accepted the need for systems administration, network administration and database administration functions. The value of managing, tuning, planning, procuring, installing and monitoring systems, networks and data resources has been well understood and practiced. Not so the notion of data warehouse management. If a company were experiencing rapid escalation in network users or traffic, you can be sure the network administrators would be worrying about understanding the reasons for such growth. Likewise, if available systems capacity diminished at highly accelerated rates over an extended period of time, systems administrators would be attempting to find out why. In these scenarios, the first response would not be to shut down the network or system, nor would it be likely that additional funds and/or technology would be allocated to extend capacity until the root cause of the growth or increased demand was well understood. Why should the data warehouse be managed any differently? Just because the data warehouse is a newer architectural construct, or because there is a plethora of technology choices being marketed to address this construct, does not mean we should approach it differently than more conventional systems administration problems with which we have significant experience.

To date, data warehousing development has focused on getting data in and getting information out. Managing the data warehouse is a new discipline which will become important to any enterprise that is concerned with understanding how effective and efficient the data warehouse is in meeting the demands of its business users. Data warehouse management is necessary for any enterprise that is concerned with optimizing their return on investment for the data warehouse environment. In order to engage in data warehouse management, we must be able to examine both the content of the data warehouse and the usage, or activity, which occurs in the warehouse.

First, we must understand data warehouse management is an ongoing concern. Just like systems management, network management and database management, it is not a single, point-in-time measurement. Forewarned of the natural predilection of the data warehouse to grow, we should approach this growth with traditional asset management values; that is, before we throw more money or technology at the problem, let's not forget to ask "why?" It is essential to understand what part(s) of the data warehouse are growing. Ask whether this means other parts should be shrinking. Is this growth an anomaly or is it an indicator of a long-term trend? Are changing business information requirements reflecting new data demands for the warehouse? If so, should demands for other data be diminishing? Do users need as much history as is currently maintained? How much of the data warehouse is actually being used? Who is using it? What's it worth to those who are using it? What does it cost to provide this information? Just as the business will continue to change, and new decision support applications are implemented to support this change, the data warehouse management function must measure the effects of this change over time.

Second, we must allocate resources to the ongoing data warehouse management activity. The goal of the data warehouse, after all, is to provide information access to the decision-makers in the business. If data access performance is decreasing, data warehouse administrators should have processes in place to know what's impacting performance and what this means to the end- user community. Is the information structured in an optimal fashion to support the types of queries being submitted? What are the queries being submitted? Are they recurring or one-time usage? Does system performance have peaks and valleys associated with days of the week or time of day? Can users be provided with incentives to utilize available resources during slow periods and ease the burden of peak periods?

The process of data warehouse management is greatly dependent on the ability to monitor activity in this environment. In traditional systems monitoring, metrics such as CPU utilization, disk I/Os and DASD utilization are important for application tuning. The primary difference and complexity in managing the data warehouse is that the decision support workload is highly variable. Instead of tuning an application to process as many identical update transactions as it can, the data warehouse is tuned to process as many variable queries as it can. In order to improve cost and performance, we must understand what patterns exist in the usage or activity of the data warehouse. Therefore, we must measure, over time, who is using the data warehouse and what data is being used. Perhaps just as importantly, we must also identify what data is NOT being used. Data that is not being used, or "dormant" data, can be archived or stored off-line to recover and reuse storage resources. Removal of dormant data can also result in more efficient use of processing resources for data that is being used. Data that is being used can be analyzed to determine if it is structured in the most efficient manner for its use. For example, data in one table that is consistently joined with data in another table might be a likely candidate for denormalization or for deployment of a star schema. Identification of who is using the data can provide help in effectively deploying data marts. For example, if all queries generated by a geographically remote department access two or three specific tables, the information in these tables might logically provide the foundation for a data mart in that remote location. Thus, continuous improvement of data warehouse performance can be achieved based on facts and data ascertained through continuous monitoring of the data warehouse environment.

Third, and perhaps easiest of all, reap the benefits provided by data warehouse management. These benefits include improvements in the areas of service-level agreements, capacity planning, performance and design optimization, and resource management. Each of these areas will be examined in more detail.

Service-Level Agreements

Service-level agreements, typically executed between IT and business users, identify what services will be provided, when the services will be available and standards of performance for these services. Service-level agreements for data warehouses usually identify hours of availability and query response times. The difficulty in measuring performance of query response is that the normal application standard is a measure of elapsed time between the user pressing the "enter" key and when the application screen is written or refreshed with data. Measurement of data warehouse queries requires several timestamps--time the query is submitted, time the first row of data is returned and time the last row is returned. The additional timestamps account for the amount of data being returned. It would be next to impossible to live up to a data warehouse service-level agreement that did not take into account how much data is being accessed. Managing all query activity in the data warehouse--measuring what queries are executed, how long they take and when peak loads occur--enables the organization to effectively commit to and monitor performance against a data warehouse service- level agreement.

Capacity Planning

Data warehouse capacity planning is greatly enhanced when content of the warehouse is being monitored. In the absence of data warehouse content management, capacity planning is basically an extrapolation of previous growth trends (i.e., if the warehouse grew 20 percent in 1998, we would use a similar growth rate to plan 1999). The biggest problem with this assumption, of course, is that there is no business context associated with it. For example, if the 20 percent growth rate was attributed to the incorporation of data from a newly acquired business unit and no other acquisitions were planned, a 20 percent growth rate for the following year might be overstated. Data warehouse content management would measure what parts of the data warehouse are growing and what parts are not. This growth is measured in business context, not technical terms. For example, rather than just knowing one million rows were added to the data warehouse, we would know the million rows corresponded to a new business unit. This information then allows for the determination of whether the growth is a single-year anomaly or whether similar growth is planned for the next year.

Performance and Design Optimization

Managing data warehouse activity contributes to continuous performance and design optimization. Understanding specific patterns of usage, including who is using the warehouse, when they are using it and what data is being accessed, provides the foundation for tuning the data warehouse for increased performance and improved design. Usage statistics help diagnose where indices should be added or dropped, where tables should combined and where aggregates would improve performance. Usage patterns can also indicate where users might benefit from the deployment of a data mart or where resources can be recovered by ridding the data warehouse of dormant data.

Improved resource management is yet another benefit of managing the data warehouse. Certainly, performance tuning and design optimization are factors in effective hardware and software resource management. There are additional resource management benefits in providing data warehouse managers, DBAs and administrators with the requisite facts and data required to make effective resource allocation decisions. Much the way the data warehouse provides access for business users to the information assets of the corporation, data warehouse management provides information assets to the data warehouse support team in order to effectively deploy its resources.

When you build the data warehouse, don't forget to plan for its management. The goal is to have a data warehouse whose content and activity metrics are consistently evaluated and understood, and continually tuned to meet the decision support needs of the business. Attain the benefits of managing the data warehouse--all the business advantages of a highly effective and efficient decision support environment-- with the additional benefits of continuous performance tuning and user satisfaction. Data warehouse management--don't overlook it!

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access