Like many great innovations throughout history, the Web caught most of the established world by surprise. Its developers almost accidentally let the genie out of the lamp, and its escape has brought untold opportunity to the world of information processing. However, like most genies, the Web also brought great hazards. In exchange for its potential, the Web has triggered a new wave of turmoil and forced evolution in the basic infrastructure that drives the modern enterprise. The transformation of the Web into an integral part of the enterprise infrastructure has resulted in changes in the characteristics of enterprise IT:

All applications are now 24x7
With the near-universal worldwide access provided by the Web, almost any Web-based application, especially a public-facing one, is by definition 24x7, even those with low duty cycles. This challenged a fundamental assumption about the existence of a maintenance window for many business applications and increased the demand for high-availability solutions, formerly the domain of a few select, large applications.

All applications are now mission-critical
When the world can access your applications, application failures are exposed to a much wider community.

All previous assumptions about capacity planning are now obsolete
A consequence of ubiquitous access is unpredictability of loads, distorting many established techniques for application capacity planning.

Security risks are magnified
With the entire world at the front door, better locks become mandatory.

President Eisenhower once said, "In preparing for battle, I have always found that plans are useless, but planning is indispensable." Whether or not an application rides on high-availability (HA) infrastructure, architects can improve application availability by applying HA design strategies. HA application design is, of course, a complex topic in itself; think of the core concepts in this article as a "starter kit" for designing HA applications, even on non-HA infrastructure. IT shops should decide when to use these strategies by (1) gathering concrete data on the business impact of application downtime, (2) analyzing the causes and likelihood of planned and unplanned downtime, (3) assessing which HA design strategies can affect which causes and (4) weighing the extra cost of HA design strategies against the benefits of reduced downtime.

The increasing complexity of systems management and escalating demands on enterprise availability have intensified the need for high-availability support solutions. While high-availability services are essential for 24x7 mission-critical applications, there are significant cost issues to be evaluated. To extend these service levels above 99.9 percent planned availability, the incremental cost increases exponentially, while the amount of downtime saved declines. Due to the high support costs and stringent configuration requirements, five nines application-level availability yields a negative return on investment for approximately 99.999 percent of all enterprises.


HA infrastructure ensures that an application has constant availability of network, processors, disks, memory, etc., such that a failure of one of these components is transparent to the application. A risk analysis identifies important functions and assets critical to HA, and subsequently establishes the probability of a breakdown in them. Once the risk is established, objectives and strategies to eliminate avoidable risks and minimize impacts of unavoidable risks can be set. For most hardware, middleware and operating systems (OSs), this means duplication and physical separation of IT systems, reducing single points-of-failure, and clustering and coupling applications between multiple systems.

Clustered server architectures provide the benefits of both high availability and performance scalability. Cluster packaging comes in many different forms: (1) multiple standalone servers (with very high-speed cluster interconnects), (2) multiple servers in a box (this would include new high-density servers as a category), (3) multiple partitions within an SMP or (4) any combination of these three. A single-system view is an important component of a cluster high-availability environment. As nodes are added to a cluster, the requirement to manage distributed cluster resources as if managing a single server becomes a critical differentiator in the selection of a high-availability system. (See Figure 1.)

Access to data and intelligent failover including dynamic reconnect are critical to application-level high availability. Key requirements for storage solutions include:

  • Improved IT service, including security, local performance options and remote data replication.
  • 24x7 data availability.
  • Cluster server support for both individual servers and generic cluster access.
  • Open source connectivity. Connect any server to any storage system through storage networks.
  • Rapid recovery and/or restart of applications when the unforeseeable does happen.

There are other critical components in an HA system. For example, there are several server adapter card techniques that can help a network manager increase network availability: load balancing, hot plug-ability, dual homing of server cards and network operating system (NOS) optimization. If you are planning on uninterruptible power supply (UPS) systems, invest in a global, shared solution with reliable switch gear and full bypass capability, rather than deploying many low-capacity (ostensibly inexpensive) UPS systems for individual racks or devices in a fragmented approach. Fortunately for enterprises that are considering building large-scale VPNs, more vendors have introduced a variety of new high-availability features. Competition in a maturing VPN gateway market will yield a stream of incremental high availability features from all the major vendors. That means that Internet VPNs for mission-critical applications, for large branch networks and for large remote-access user populations can now be designed to take advantage of these new resiliency features, reducing the risks and costs associated with network congestion and downtime.
Database products are another critical component of system infrastructure, and the DBMS vendors have been actively enhancing their products to fit in an HA world.

Demand for advanced manageability and scalability increases: DBMS vendors continue to emphasize enhanced self-tuning capabilities for databases and ease of operational complexity. Scalability may be the decision factor for many enterprises that support large HA and mission-critical databases.

Integrated monitoring solutions emerge: DBMS tools vendors offer highly integrated monitoring solutions supporting heterogeneous databases across platforms, providing a holistic view of the entire environment, monitoring applications end-to-end.

Spending increases to ensure business continuity: Enterprises continue to address the growing need for business continuity by deploying redundant databases at remote locations. Both hardware and software vendors will offer enhanced features to support this market, while DBMS vendors will offer improved, scalable and high-performance standby database and data replication technology to support business continuity.

Database sizes continue to grow: Features such as manageability, availability, scalability and reliability have become increasingly important; therefore, DBMS vendors are pushing for increased capacity support for petabyte-range databases.

Demand for higher availability features increases: DBMS vendors are showcasing increased numbers of nodes supporting clustered databases, improving scalability and availability. Post 9/11, the number of enterprises deploying redundant databases at remote locations to address business continuity has been increasing.


HA infrastructure is not the only way to increase application availability. Application servers typically have HA features (J2EE servers in particular), but a technically demanding application can easily require supplemental design strategies. HA application design is concerned with maintaining application operation in the midst of application failures, infrastructure failures and real-time maintenance. The strategies that follow can be used individually or in combination:

Redundancy: Each element of an HA application must have a backup that can take over if the primary fails. Load-balancing features share the load during normal operation and shift the load when a node fails. Alternatively, one or more hot standbys might take over if a primary fails, and the design must account for transactions that were in-flight when the failure occurred.

Recoverable state design: An application's handling of in-flight transactions is largely determined by its approach to state management. "Stateless execution" is often put forth as an HA design principle; but while it is true that an individual element is "more HA" if stateless, the application as a whole typically cannot be viewed as stateless ­ users make a series of requests, and later requests build on earlier ones. Thus, it is necessary to store state between exchanges, replicate the state (so that it is not subject to a single point of failure) and then reestablish state after recovery.

Failure detection: To initiate recovery of state, and for any failure scenario not handled transparently to the application, there must be "detect and retry" logic within the application. The server side of the application may be able to do this transparently (preferred), but the client side may have to do it. The application may have to "fail gracefully" by saving transaction information, notifying a user or administrator and performing cleanup upon application restart.

Watchers and heartbeats: An HA application must be watched in real time to ensure it is still running. Two key design strategies are process watchers, which monitor execution of application processes on the watcher's machine, and heartbeats, where a network-based element responds to periodic "are you still there?" messages.

Operations management integration: Monitoring and management tools may adequately manage watcher and heartbeat functions, but operations integration can go much deeper. Applications may incorporate management APIs to raise alerts (e.g., SNMP traps), enable full monitoring and management (e.g., SNMP MIBs) and write errors to logs that are monitored by a management tool.

Automatic restart: When a watcher or management tool detects a failure, restart must perform necessary application cleanup, reinitiate application processes, reconnect them as appropriate and reregister them with application naming services.

Version migration: The highest levels of availability require eliminating planned downtime, which may involve upgrading application versions while the application is running. The two basic approaches for this are parallel operation of multiple versions and a "flash cut" to a hot standby (in-flight transactions complete on the old version, all new transactions go to the new version). Supplemental approaches include auto-update clients and version awareness within application interfaces (or within the infrastructure, as in .NET's version management). The biggest issue arises when a new version changes data structures. Without a downtime window in which to perform the conversion, the application must be written to handle data conversion on the fly.

Connection management: The application must be designed to handle connection failures (e.g., network, DBMS) by recognizing connection timeouts and reestablishing connections to alternate providers, most likely found via an application naming service.

Multi-threaded resource requests: For resource requests that have the possibility of a timeout, an HA application may spawn separate threads for making such requests. This allows the application to more effectively manage response to its users when it experiences a timeout due to a resource failure.

Transaction-aware design: Transaction management features (e.g., those of application servers and DBMSs) will ensure transaction integrity, but only if the failure occurs within the context of transaction control boundaries. Some transactions can be submitted multiple times with no loss of integrity (e.g., an address update) while some cannot (e.g., an account withdrawal). Upon a request failure, the application should validate whether the transaction was properly applied and, if not, restart it (or perhaps notify an end user).

Indirection: The principle of indirection underlies many design principles; i.e., an application element should never know the physical address of another ­ instead, elements should find each other by name. This allows elements to be moved and reconnected in a failure scenario without changing the application.

HA design can add significant cost to an application delivery effort (and HA infrastructure will add costs which will be discussed later in this article). In addition, testing an HA application is more expensive because it is often difficult to recreate various failure scenarios. In addition to development costs, HA application design impacts performance and operations management. Therefore, the appropriate level of HA design for any given application is highly dependent on business considerations, the identifiable business risks and impacts of downtime for the application.


The duplicated hardware, software license fees, facilities, etc., are easily priced; but the ongoing support costs grow exponentially. Maintaining the IT infrastructure is an ongoing process, made even more critical in HA mode; and companies must routinely take the time to reevaluate their requirements:

  • Identify the IT needs, goals and measurement metrics.
  • Review the company's current architecture and support.
  • Evaluate gaps between the actual performance and the goals.
  • Construct a plan for reducing the gaps.
  • Assess the costs required to attain goals and adjust goals if needed.
  • Determine whether to conduct an ROI or cost/benefit analysis.
  • If deemed necessary, perform an ROI study and define metrics before implementing new projects or purchasing support.

Because support costs are dependent on the services provided and the infrastructure configuration, absolute price points are irrelevant; therefore, HA support should be priced as a multiple of the cost of a "standard" offering. Figure 2 provides an estimated cost multiple for each level of availability and applies the cost multiple to calculate a "prevention cost per hour of downtime." The cost multiple provided is an average derived from cost estimates provided by major services providers; the actual multiple is subject to change and varies by provider. The cost multiple would be multiplied by the cost of a standard HA of 99 percent to determine the total cost of the higher availability solution. For example, if you historically spent $100,000 per year to achieve 99 percent availability (88 hours total downtime), plan to spend (290.7 x (0.011 x $100,000)) or almost $320,000 for five nines availability (downtime < 5 minutes per year).


The high cost of downtime can be devastating to an enterprise. In the information-driven economy, downtime for any reason is unacceptable. In fact, availability and performance go hand in hand. Regardless of why ­ application or database failure, system upgrade, operational error or just poor performance ­ if a Web site or an application is slow in delivering requested information, it might as well be offline. The consequences ­ lost data, lost customers, lost revenues ­ can be devastating to an enterprise. Under these conditions, you have to maintain continuous uptime and predictable performance levels; and if an outage or disaster occurs, quick recovery with minimal data loss is imperative. HA design and implementation can help avoid the devastating consequences of downtime.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access