In today's marketplace, computer systems hold key functions to business profitability, and downtime of critical systems can result in a significant loss of revenue and productivity. In past times, guaranteed recovery from a system disruption was good enough for customers who wanted protection from unforeseen IT interruptions. However, today's customers are not merely concerned about recovery--they want true business continuity. One very important aspect of business continuity is high availability (HA).

The definition of HA varies depending on how the user defines it, as well as the business need. For instance, a financial institution application may need to operate 24x7x365. The need, in this case, would be to have 100 percent system uptime, with no outages. However, less critical applications may only need to remain up during business hours and would, therefore, have less stringent high- availability needs. Either way, the basic concepts of high availability and preparing applications and systems for this growing need are important.

Highly available systems are designed to withstand any single failure and to minimize the unintended downtime of a system. A highly available system uses a combination of duplicate hardware, complex system configurations and special HA software which, utilized together, can handle many different failure scenarios. High availability is achieved not only by hardware redundancy, but also by paying special attention to facilities, software, network, application systems, people skills,

procedures and business processes. These IT areas can be grouped into two main categories, which can then be further classified.

Infrastructure components include the following:

  • Hardware (e.g., disks, servers, workstations)
  • Software (e.g., operating systems, network operating systems, databases, etc.)
  • Network components (e.g., routers, switches, etc.)
  • Facilities (e.g., power, etc.)
  • Infrastructure design (e.g., all components of an infrastructure have failover capability)
Availability (%) Cumulative Outage
(based upon 365 days per year)
99.9999 32 seconds
99.999 5 minutes
99.99 53 minutes
99.9 8.8 hours
99 87 hours (3.6 days)
90 876 hours (36 days)

Application/business components include the following:

  • Application design (e.g., applications are designed to adequately address availability and the possible elimination of batch- processing windows)
  • Operations tools and procedures (e.g., processes and procedures in place in case of a failure)
  • Business processes (e.g., disaster recovery process is documented)
  • Support personnel skills (e.g., operations personnel are properly trained to manage, monitor and configure systems)

Measuring High Availability

Industry measures availability as a "number of nines," representing the percentage of time an application or system is deemed available throughout an entire year. For instance, if a system or application has 99.99 percent availability over the course of a year, this means that a system has a downtime of 53 minutes within that year (see Figure 1). Fifty-three minutes of downtime may be acceptable for some applications; but for more critical applications, this amount of downtime may not be acceptable. Adding another "9" to the percentage of availability, making it 99.999 percent, increases the amount of time a system is available for productive use and reduces downtime to only five minutes within a one-year period.

Typically, the level of availability of a system or application is determined by the business impact an outage creates. In general, availability of applications/systems can be organized into three main categories:

  • Fault-Tolerant – Downtime of this application usually affects customers and will have a drastic effect on the company's image or profitability. A fault-tolerant application's goal is to be up 365x24x7 to minimize the potential losses that will be incurred if the application goes down.
  • Enhanced Availability – Downtime of this application results in some losses for a company, but usually does not directly affect customers. An application that requires enhanced availability could affect the day-to-day activities of a company's employees, which will require that the application be up during most business hours.
  • Basic Availability – Downtime of this type of application will result in negligible losses and will affect no customers and very few company employees. An application that will require basic availability does not have an uptime requirement; however, it should be available to employees for productive use.

Cost- versus-benefits analysis is almost always taken into consideration when determining what level of availability is required. Having a highly available system may be advantageous; but a question to always keep in mind when trying to determine the level of availability to design for is: Do the costs of implementing the HA system outweigh the benefits that could be achieved by the business?

Availability Challenges

There are a number of factors that can negatively impact a system's availability. All factors must be considered and weighed carefully when trying to plan for required level of availability. The factors that can decrease a system's uptime are generally called outages and can be classified into two basic types: unplanned and planned outages.

Unplanned outages are those factors that cannot be predetermined. Examples include: hardware failures, software failures, network failures, data backup failures, human error, and natural disasters.

Planned outages, or planned downtime, are foreseen factors that impact availability. They include: hardware upgrades, software upgrades, planned network changes, disaster recovery processes, database re-indexing and reorganization, batch-processing windows, and building and maintenance repair (may need to turn power off).

In order to properly address all of the potential outages that could impact a system's availability, a baseline availability model should be constructed to identify weaknesses that need to be addressed. All high-availability infrastructure and architecture/business components should be considered when assessing availability limitations

Figure 2 provides an example of an availability model chart to help identify the HA challenges.

Component Fault Type Mean Time to Recovery Mean Time Between Failure Effect on System User
Data Center
Power Supply Power Failure 1 hour 7,000 hours No service
Entire Data Center Site Disaster 6 hours 35 years No service
Database Server
Hard Drive (single drive within a mirrored pair) Unit Failure 4 hours 400,000 hours No effect
Hard Drive (both drives within a mirrored pair) Unit Failure 20 hours >1,000,000 hours No service
Ethernet adapter Unit failure 3 hours 200,000 hours No service

Addressing High-Availability Challenges

There are many different techniques and solutions that can be applied to address the various high-availability challenges. Sometimes, a single solution can be used to address availability requirements; but, more often, a combination of solutions is required. The level of availability of an application/system and cost versus need should be taken into consideration when determining which of the following solutions are needed:

  • Component reliability – Ensures that each physical component within a system will be evaluated with a focus on reliability within each of the following areas: hardware, operating system/network operating system, application platforms and various support tools. Keep in mind that each system is only as reliable as its weakest component.
  • Redundancy – Ensures that redundancy is addressed across multiple geographies, within single sites and within single systems at a location. Redundancy is important because each component cannot really be 100 percent reliable.
  • Data replication and distribution – Minimizes downtime of databases with tools that perform faster reorganizations and backups. In addition to the tools, databases should be designed with redundancy in mind to ensure that data will be continuously available.
  • Database recovery – Designs recovery mechanism with small, adjustable recovery units.
  • Application design – Designs the application to allow batch processing in the background concurrent with online processing, giving special consideration to locking issues and techniques for reversing data errors without rollback or database recovery procedures.
  • Application testing –Tests for code errors and bugs to increase an application's availability. The architect of the application solution must keep high availability in mind when designing, developing and testing.
  • Special software on critical system components – Uses special hardware, such as uninterruptible power supplies (UPSs) for any critical component of a system (servers, hubs, routers, printers, etc.). RAID disk arrays can also be employed which offer redundancy and easy restorations. Tape and optical disk devices are other devices that provide data backup and restoration.
  • Monitoring – Monitors and alerts operations personnel when something may be reaching a critical state. Monitoring can consist of "watching" process IDs to ensure the processes are up and running or "pinging" hardware to ensure that a system is up and functioning.
  • Security – Minimizes vulnerabilities within an architecture that could be exploited and could degrade performance or introduce viruses into the system that could take the system offline.
  • Change-control processes – Manages the number of changes introduced into the production environment. Research shows that nearly 80 percent of all production downtime results from operator or application errors. Therefore, a highly robust and stringent change-control process can help reduce the number of errors introduced into the production environment.

The High Availability Marketplace

Now that you have been introduced to the basics of HA, you should understand what kinds of services and products are offered in the high-availability marketplace. As the need for HA systems increases, companies need more comprehensive solutions. These solutions could include anything from defining procedures for IT staff to recover from a failed- over server to having a backup power generator in the event of a blackout. The current HA marketplace can be divided into four main categories:

  • Hardware: Vendors offering hardware solutions to achieve high availability. Some of these vendors also offer high-availability services to supplement their hardware offerings.
  • Software: Vendors offering software solutions (for instance, clustering and availability monitoring software) to achieve high availability.
  • Consultants, Integrators and Outsourcers: Vendors that provide consulting services in relation to high-availability design and planning.
  • Business Continuity Services Providers: Vendors that provide extensive service offerings encompassing assessment, design, operation, testing and implementation of business continuity and recovery services, including all hardware and software components as well as business processes. These vendors were formerly known as recovery service providers.

High availability is a complex matter that should be carefully considered for any critical application/system. Addressing availability should be integral to all stages of the development life cycle, from requirements gathering to operations readiness testing. If availability requirements are addressed early, disruption to a company's key applications and components can be minimized.

META Group, Database Replication for High Availability: Part 1, 17 July 2001

IBM Server Group, The Defining Standard in High Availability, March 1999

Accenture, Always-On Computing: "High Availability" Marketplace Point of View, August 2001

aAccenture, SIDM High Availability Proof of concept project, SPRINT, 11/23/1999

bAccenture, SMDP Release 2.0, USAA, Framework/Monitoring Hardware High Availability, 8/17/1999

cComputer World, Fault-Tolerant Computing, 20 November 2000

dGartner, High Availability: A Perspective, 15 June 2001

eGiga Group, Trends in High Availability Offerings, 19 June 2000

fGiga Group, Delivering High Availability: Can ASPs Make the Grade? 31 March 2000

gIBM, Business Continuity: New risks, new imperatives and a new approach, 1999

hIBM, Geographically Dispersed Parallel Sysplex: the Ultimate e-business Availability Solution, January 2001

iIDC, Continuity Services: New Rules, New Opportunities, and the Need for a New Service Model, Bulletin, February 2000

jIDC, Clustering and High Availability Software Market Forecast and Analysis, 2001-2005, June 2001

kIDC, Braving the Solutions Storm in the Services Industry: The Tale of the Top 15 Global Players, March 2001

lMETA Group, How High Availability? 17 January 2000

mMETA Group, Database Replication for High Availability: Part 1, 17 July 2001

nMETA Group, The Business Cost of High Availability, 12 June 2001

oYankee Group, Internet Integrators – Pulling IT Together, Report Vol. 9, No. 10, August 1999

Accenture, Availability: Definition and Overview, November 16, 2001

aGartner, High Availability Q&A: Failures, Standards, and Metrics, July 29, 1998, QA-05-2701

bGartner, The Cost of Achieving Higher Levels of Availability, June 29, 2001, SPA-13-9852

cGiga Group, Think It's Easy to Obtain High Availability Services? Think Again, November 30, 1999, Giga_809607-AF99.pdf

dGiga Group, What Classifies a System as "Highly Available"?, March 30, 2000, Giga_RIB-032000-00402.pdf

eGiga Group, Evaluation Criteria for Selecting High-Availability Services, December 20, 2000, Giga_RPA-122000-00030.pdf

fGiga Group, High Availability: Evaluating the Costs, March 30, 2001, Giga_RPA-032001-00040.pdf

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access