Availability, scalability and manageability are becoming as essential for business intelligence and data warehousing applications as they are for their operational application counterparts. This poses some significant challenges for IT staff, especially given the huge amount of data often managed by a data warehouse in a business intelligence environment, the complexity of the analytical processing involved and the numbers of business users that need to be supported.
One key solution that helps improve availability, scalability and manageability is network storage, which enables data warehouse information to be maintained by specialized networked servers that are optimized for data management. These servers provide universal access to warehouse data from anywhere in the network. According to IDC, approximately 30 percent of corporate data is currently managed by network storage; however, this is expected to grow to 70 percent by 2006.
This article discusses the issues involved in managing and administering a high-performance and high-availability data warehousing and business intelligence system incuding the benefits network storage hardware and software offer in such an environment.
The Evolution of Network Storage
In a traditional data center environment, data management tasks are handled by tape and disk devices directly attached to each server in the network (commonly referred to as direct attached storage, or DAS). When remote applications and user devices require warehouse data, the data requests are routed across LAN, WAN and wireless networks to the appropriate warehouse server that maintains the required data. This conventional DAS architecture has several problems:
As the demand for business intelligence grows, data warehouse storage requirements and business intelligence workloads will increase. This increase in resource requirements, however, is nonlinear over time and is not evenly distributed across the enterprise network. The rigid nature of direct attached storage makes it very difficult to satisfy disk storage and network bandwidth requirements as the demand for business intelligence increases.
- Adding storage devices to servers to meet new resource requirements frequently means that servers must be taken offline to perform hardware and software upgrades, thus impacting data warehouse availability.
- Business intelligence workloads vary over time based on business requirements. Month-end and year-end are examples of periods when business intelligence processing is likely to create peaks in demand, which, in turn, will create hot spots and bottlenecks both in the network and in server I/O subsystems. In a direct attached storage model, these peaks are difficult to manage, which means that the performance of all users and all applications is affected.
- Loading and backup of large data warehouse databases involve significant processing for both the data warehouse server and the network when accessing remote operational source data and backup servers. The elapsed time to perform these operations can be lengthy due to the large amounts of data involved. There is a potential impact on the availability of both operational and business intelligence applications because these applications may have to be stopped while the data warehouse operations are taking place. Although this data warehouse challenge is not directly related to DAS, this article will show that network storage products offer some unique solutions for data warehouse maintenance.
To address these issues, organizations need a data storage solution that allows them to offload processing of data from generalized networked application and database servers. Furthermore, this solution must provide a flexible and scalable architecture for administrators to optimize network and I/O performance. The industry solution that meets these requirements is network storage.
What is Network Storage?
Network storage devices are optimized for data management tasks. These devices are attached to the corporate network (see Figure 1) and can be accessed from networked applications throughout the enterprise. Four key benefits of network storage are:
- Data Sharing: Network storage enables disk storage to be consolidated and pooled into a shared network resource that can be accessed by clients and servers anywhere in the network. This architecture provides a great deal of flexibility for optimizing enterprise-wide disk space and I/O requirements. The separation of storage management from application and database servers also allows data storage upgrade decisions to be separated from server upgrade decisions, thus providing a flexible deployment environment.
- Simplicity: Network storage is simple to implement and administer and can be integrated into an existing network infrastructure with minimal cost. A subsequent discussion will show that the ability to realize this benefit will vary depending on the network storage solution employed.
- Scalability: Network storage devices offer capacity and performance scalability that enable organizations to adapt easily to increasing demand for disk storage and I/O bandwidth. Obviously, network speed and bandwidth affect the ability to scale network storage performance (more detail later).
- Manageability: Consolidating the data management requirements of multiple applications into a shared network device makes it considerably easier for storage and database administrators to do regular housekeeping tasks such as backup. In addition, network storage not only facilitates the management and monitoring of storage allocation, capacity and performance, but recovery as well, enhancing the availability of the overall system.
Figure 1: Data Storage Architectures
Network Storage Technologies and Architectures
Network storage is not new. Organizations have been using general-purpose networked file servers for many years to share data across heterogeneous and distributed processing environments. These file servers run standard operating systems (UNIX, Microsoft Windows NT, for example), and employ de facto industry standard file protocols such as UNIX NFS and Windows CIFS. This enables remote applications to access data on the networked file server without being aware that the data is remote (i.e., the data appears to the remote application as a virtual local file).
While general-purpose file servers are suited to the sharing of data by file-based applications such as office applications and Web servers, they have not typically been used for database-driven application processing such as data warehousing. This is because limited network speed and the TCP/IP stack processing overheads on both the application server and the network file server can degrade performance. Recent advances in hardware and software technology, however, now make the use of network storage viable for data warehouse processing. Some of the key technologies (see Figure 2) in this area are:
Figure 2: NAS and SAN Network Storage Technologies
Network Attached Storage (NAS): This IP- and Ethernet-based network storage architecture replaces the general-purpose file server with a server running a custom operating system (stripped down UNIX kernel, reduced Linux, specialized Windows 2000 kernel, custom real-time operating system) that is optimized for data processing and management. The optimized operating system improves file server performance and supports features such as RAID, caching, clustering and specific data management features such as fast snapshots and remote mirroring for high performance and availability. One key advantage of the NAS architecture is that it is much easier to manage than a DAS or SAN environment. In NAS, data access uses a file-level I/O protocol such as NFS or CIFS.
Storage Area Network (SAN): This network storage architecture employs a dedicated high-speed network (usually implemented using Fibre Channel) to connect dedicated storage servers, disk arrays, virtual tape servers, etc., to one or more application servers. The application server and the storage devices communicate using a low-level block-based SCSI-3 protocol. SAN technology is implemented using either a direct point-to-point connection (direct-attached SAN) or a network switch to a data storage farm (true SAN) as shown in Figure 2. From an administration perspective, SAN is similar to DAS in that DBAs must be careful how they allocate and optimize the database and log files on the data storage system. A SAN typically provides better performance than a NAS architecture due to its use of a block-level I/O protocol.
Gigabit Ethernet: These one and ten gigabyte/second high-speed networks help improve NAS performance by reducing network latency. Although storage devices can share a high-speed TCP/IP Ethernet network with other applications and servers, it is often better for performance, manageability and security to create a private storage network that is dedicated to the storage devices. Another advance in this area is to improve network performance by offloading the TCP/IP stack processing onto network hardware adapters (this is known as a TCP/IP offload engine or TOE).
Direct Access File System (DAFS): This protocol provides a virtual interface (VI) storage technology where applications can access a network storage device via direct memory-to-memory requests to speed data access. This improves performance by reducing the overhead (for example, context switching, TCP/IP stack processing) that normally occurs on the application server when crossing a network. DAFS comes in two flavors: user space DAFS (uDAFS) where the DAFS support is provided by the application (e.g., a database system) and pseudo device DAFS (dDAFS) where the DAFS support is in the operating system. dDAFS is transparent to the application (i.e., it requires no application changes) but has the disadvantage that a context switch to the operating system kernel is still required to invoke it. The DAFS architecture also supports database raw device I/O, which is not possible using a generalized network file server. Database vendors are working with network storage vendors to exploit the use of DAFS with database applications. DAFS is transport independent and can operate over Ethernet, Fibre Channel and InfiniBand (see following description) networks.
InfiniBand (IB): This high-performance I/O interconnect is designed for both tightly coupled and network-coupled systems. InfiniBand is a merger of two efforts to find a high-speed alternative to the aging PCI bus: NGIO led by Intel and Microsoft, and FutureIO led by IBM, Compaq and Sun. An InfiniBand network can be bridged to both an Ethernet and a Fibre Channel network and can, therefore, be used to access both NAS and SAN devices. InfiniBand can also be used in conjunction with DAFS.
iSCSI: This new protocol developed by the Internet Engineering Task Force (IETF) improves performance by allowing block-mode SCSI commands to be used over an IP-based Ethernet network. iSCSI enables organizations to manage storage over long distances and improves interoperability and flexibility in a SAN-style environment because it allows a dedicated Fibre Channel (FC) network to be replaced by a generalized IP-based Ethernet network. This type of configuration can be thought of as a hybrid NAS/SAN architecture.
There is considerable debate in the industry about the pros and cons of using NAS devices versus a SAN; however, that discussion is beyond the scope of this article. Much of the debate focuses on the flexibility and ease-of-use of NAS versus the complexity and high performance of a SAN. Performance, however, is a moving target, especially given evolving network storage technologies such as DAFS and iSCSI. It is also important to consider application performance as perceived by the user, rather than just raw hardware and software I/O performance. Regardless of the technology used, some applications experience performance degradation when using network storage, while the performance of others is enhanced. The NAS versus SAN debate is becoming largely pointless because the industry direction is toward hybrid NAS/SAN products (i.e., a unified network storage model). With this in mind, potential users of these technologies should focus their attention on vendors working on these hybrid solutions.
DB2 UDB Support for Network Storage
At present, DB2 UDB support and usage of network storage are targeted primarily at application and database servers running UNIX, Linux or Windows NT. All subsequent discussion in this article will therefore be restricted to those operating environments.
Classic UNIX and Windows NT direct attached disk devices provide good DB2 performance, but often have limited availability and are inflexible and difficult to manage. To improve availability and simplify administration, DB2 UDB supports the storage and maintenance of data warehouse and system data (recovery logs, for example) on both SAN and NAS devices.
In the SAN environment, DB2 works with both IBM (the "Shark" Enterprise Storage Server, for example) and third-party SAN solutions. Whereas many of the benefits of network storage can be obtained by using a SAN, DB2 administrators, as in a DAS environment, still have to be concerned about where data resides and have to balance the workload across the I/O subsystem of the SAN. Many DB2 and system administrators also find a SAN environment complex to install and manage.
DB2 administration can be simplified by moving to a NAS environment. As with a SAN, DB2 supports both IBM and third-party NAS products. On the third-party front, IBM has a close development and marketing relationship with Network Appliance. One outcome of this relationship is the Network Appliance DAFS DataBase Accelerator, which offers a dDAFS device driver for DB2 UDB running on Sun Solaris. IBM and Network Appliance have also worked together on several functional, performance and scalability benchmarks to demonstrate the advantages of network storage.1
With DB2 UDB V7.2, IBM delivered two new DB2 features that are very useful in the network storage environment: suspended I/O and the db2inidb utility. The SET WRITE SUSPEND command suspends write I/Os to tablespaces and logs files for a specific database (write I/O operations are resumed using the SET WRITE RESUME command). This facility enables a consistent copy or backup of a database to be made without the need to take the database offline. The copy can be used for read-only processing (by a data warehouse ETL tool such as the Data Warehouse Center) or can be used in conjunction with the db2inidb utility to request DB2 to restore and, if required, forward recover a failed database.
Network Storage Benefits for DB2 Data Warehouse Users
Network storage provides several benefits to DB2 data warehouse and business intelligence users.
Availability and Accessibility: Facilities such as RAID, hot pluggable devices, cluster support and fast backup and recovery all contribute to business intelligence applications having increased availability and accessibility.
- Scalability: The cost per megabyte of network storage provides a cost-effective solution for the large amounts of disk storage and I/O required by DB2 data warehouse and business intelligence tools and applications.
- Manageability: Network storage appliances are easy to install, can be administered from a single Web top and can handle data and disk space growth without affecting business intelligence users and applications. They also make it easy to copy, replicate and mirror data warehouse databases for testing, backup, archiving and disaster recovery.
- Security: Network storage provides a secure method for business intelligence users and applications to share, copy and back up information anywhere in the corporate network.
The combined benefits of network storage provide excellent return on investment and reduced total cost of ownership (TCO) for DB2 business intelligence applications. TCO savings come from reduced complexity (simplified installation and database administration), better data accessibility (data warehouse information sharing), high data availability (reduced business intelligence application downtime) and a highly scalable architecture (for supporting data warehouse growth). According to a 2001 study by INPUT, network storage products have a 70 percent lower TCO compared with other storage management solutions. Garter, Inc. research estimates that approximately 55 percent of server costs are directly attributable to storage, meaning such TCO savings can have a significant positive effect on the overall IT budget.
Like all hardware vendors, IBM recognizes the industry direction toward the use of network storage and a unified network storage model that exploits the benefits of both the SAN and NAS architectures. To this end, IBM is putting significant effort into exploiting new technologies such as DAFS and InfiniBand in its DB2 UDB product and into working with third-party network storage solution providers to fully exploit the benefits of the network storage model for business intelligence and data warehousing applications. The move by IBM toward supporting the DAFS direct-memory model will improve business intelligence performance by reducing TCP/IP and context switching overheads on database and application servers. Further improvements in performance will come for business intelligence applications with very large databases and high-performance requirements through the use of block-mode protocols such as iSCSI.
For DB2 data warehousing, network storage is becoming a key storage management solution for supporting scalable data warehouse applications ranging from small data marts to enterprise data warehouses. Network storage also helps solve many data warehouse operational issues in areas such as data availability and sharing, the operational applications batch window and database backup and disaster recovery.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access