In today's marketplace, computer systems hold key functions to business profitability, and downtime of critical systems can result in a significant loss of revenue and productivity. In past times, guaranteed recovery from a system disruption was good enough for customers who wanted protection from unforeseen IT interruptions. However, today's customers are not merely concerned about recovery--they want true business continuity. One very important aspect of business continuity is high availability (HA).
The definition of HA varies depending on how the user defines it, as well as the business need. For instance, a financial institution application may need to operate 24x7x365. The need, in this case, would be to have 100 percent system uptime, with no outages. However, less critical applications may only need to remain up during business hours and would, therefore, have less stringent high- availability needs. Either way, the basic concepts of high availability and preparing applications and systems for this growing need are important.
Highly available systems are designed to withstand any single failure and to minimize the unintended downtime of a system. A highly available system uses a combination of duplicate hardware, complex system configurations and special HA software which, utilized together, can handle many different failure scenarios. High availability is achieved not only by hardware redundancy, but also by paying special attention to facilities, software, network, application systems, people skills,
procedures and business processes. These IT areas can be grouped into two main categories, which can then be further classified.
Infrastructure components include the following:
- Hardware (e.g., disks, servers, workstations)
- Software (e.g., operating systems, network operating systems, databases, etc.)
- Network components (e.g., routers, switches, etc.)
- Facilities (e.g., power, etc.)
- Infrastructure design (e.g., all components of an infrastructure have failover capability)
|Availability (%)|| Cumulative Outage |
(based upon 365 days per year)
|99||87 hours (3.6 days)|
|90||876 hours (36 days)|
Application/business components include the following:
- Application design (e.g., applications are designed to adequately address availability and the possible elimination of batch- processing windows)
- Operations tools and procedures (e.g., processes and procedures in place in case of a failure)
- Business processes (e.g., disaster recovery process is documented)
- Support personnel skills (e.g., operations personnel are properly trained to manage, monitor and configure systems)
Measuring High Availability
Industry measures availability as a "number of nines," representing the percentage of time an application or system is deemed available throughout an entire year. For instance, if a system or application has 99.99 percent availability over the course of a year, this means that a system has a downtime of 53 minutes within that year (see Figure 1). Fifty-three minutes of downtime may be acceptable for some applications; but for more critical applications, this amount of downtime may not be acceptable. Adding another "9" to the percentage of availability, making it 99.999 percent, increases the amount of time a system is available for productive use and reduces downtime to only five minutes within a one-year period.
Typically, the level of availability of a system or application is determined by the business impact an outage creates. In general, availability of applications/systems can be organized into three main categories:
- Fault-Tolerant Downtime of this application usually affects customers and will have a drastic effect on the company's image or profitability. A fault-tolerant application's goal is to be up 365x24x7 to minimize the potential losses that will be incurred if the application goes down.
- Enhanced Availability Downtime of this application results in some losses for a company, but usually does not directly affect customers. An application that requires enhanced availability could affect the day-to-day activities of a company's employees, which will require that the application be up during most business hours.
- Basic Availability Downtime of this type of application will result in negligible losses and will affect no customers and very few company employees. An application that will require basic availability does not have an uptime requirement; however, it should be available to employees for productive use.
Cost- versus-benefits analysis is almost always taken into consideration when determining what level of availability is required. Having a highly available system may be advantageous; but a question to always keep in mind when trying to determine the level of availability to design for is: Do the costs of implementing the HA system outweigh the benefits that could be achieved by the business?
There are a number of factors that can negatively impact a system's availability. All factors must be considered and weighed carefully when trying to plan for required level of availability. The factors that can decrease a system's uptime are generally called outages and can be classified into two basic types: unplanned and planned outages.
Unplanned outages are those factors that cannot be predetermined. Examples include: hardware failures, software failures, network failures, data backup failures, human error, and natural disasters.
Planned outages, or planned downtime, are foreseen factors that impact availability. They include: hardware upgrades, software upgrades, planned network changes, disaster recovery processes, database re-indexing and reorganization, batch-processing windows, and building and maintenance repair (may need to turn power off).
In order to properly address all of the potential outages that could impact a system's availability, a baseline availability model should be constructed to identify weaknesses that need to be addressed. All high-availability infrastructure and architecture/business components should be considered when assessing availability limitations
Figure 2 provides an example of an availability model chart to help identify the HA challenges.
|Component||Fault Type||Mean Time to Recovery||Mean Time Between Failure||Effect on System User|
|Power Supply||Power Failure||1 hour||7,000 hours||No service|
|Entire Data Center||Site Disaster||6 hours||35 years||No service|
|Hard Drive (single drive within a mirrored pair)||Unit Failure||4 hours||400,000 hours||No effect|
|Hard Drive (both drives within a mirrored pair)||Unit Failure||20 hours||>1,000,000 hours||No service|
|Ethernet adapter||Unit failure||3 hours||200,000 hours||No service|
Addressing High-Availability Challenges
There are many different techniques and solutions that can be applied to address the various high-availability challenges. Sometimes, a single solution can be used to address availability requirements; but, more often, a combination of solutions is required. The level of availability of an application/system and cost versus need should be taken into consideration when determining which of the following solutions are needed:
- Component reliability Ensures that each physical component within a system will be evaluated with a focus on reliability within each of the following areas: hardware, operating system/network operating system, application platforms and various support tools. Keep in mind that each system is only as reliable as its weakest component.
- Redundancy Ensures that redundancy is addressed across multiple geographies, within single sites and within single systems at a location. Redundancy is important because each component cannot really be 100 percent reliable.
- Data replication and distribution Minimizes downtime of databases with tools that perform faster reorganizations and backups. In addition to the tools, databases should be designed with redundancy in mind to ensure that data will be continuously available.
- Database recovery Designs recovery mechanism with small, adjustable recovery units.
- Application design Designs the application to allow batch processing in the background concurrent with online processing, giving special consideration to locking issues and techniques for reversing data errors without rollback or database recovery procedures.
- Application testing Tests for code errors and bugs to increase an application's availability. The architect of the application solution must keep high availability in mind when designing, developing and testing.
- Special software on critical system components Uses special hardware, such as uninterruptible power supplies (UPSs) for any critical component of a system (servers, hubs, routers, printers, etc.). RAID disk arrays can also be employed which offer redundancy and easy restorations. Tape and optical disk devices are other devices that provide data backup and restoration.
- Monitoring Monitors and alerts operations personnel when something may be reaching a critical state. Monitoring can consist of "watching" process IDs to ensure the processes are up and running or "pinging" hardware to ensure that a system is up and functioning.
- Security Minimizes vulnerabilities within an architecture that could be exploited and could degrade performance or introduce viruses into the system that could take the system offline.
- Change-control processes Manages the number of changes introduced into the production environment. Research shows that nearly 80 percent of all production downtime results from operator or application errors. Therefore, a highly robust and stringent change-control process can help reduce the number of errors introduced into the production environment.
The High Availability Marketplace
Now that you have been introduced to the basics of HA, you should understand what kinds of services and products are offered in the high-availability marketplace. As the need for HA systems increases, companies need more comprehensive solutions. These solutions could include anything from defining procedures for IT staff to recover from a failed- over server to having a backup power generator in the event of a blackout. The current HA marketplace can be divided into four main categories:
- Hardware: Vendors offering hardware solutions to achieve high availability. Some of these vendors also offer high-availability services to supplement their hardware offerings.
- Software: Vendors offering software solutions (for instance, clustering and availability monitoring software) to achieve high availability.
- Consultants, Integrators and Outsourcers: Vendors that provide consulting services in relation to high-availability design and planning.
- Business Continuity Services Providers: Vendors that provide extensive service offerings encompassing assessment, design, operation, testing and implementation of business continuity and recovery services, including all hardware and software components as well as business processes. These vendors were formerly known as recovery service providers.
High availability is a complex matter that should be carefully considered for any critical application/system. Addressing availability should be integral to all stages of the development life cycle, from requirements gathering to operations readiness testing. If availability requirements are addressed early, disruption to a company's key applications and components can be minimized.
META Group, Database Replication for High Availability: Part 1, 17 July 2001
IBM Server Group, The Defining Standard in High Availability, March 1999
Accenture, Always-On Computing: "High Availability" Marketplace Point of View, August 2001
aAccenture, SIDM High Availability Proof of concept project, SPRINT, 11/23/1999
bAccenture, SMDP Release 2.0, USAA, Framework/Monitoring Hardware High Availability, 8/17/1999
cComputer World, Fault-Tolerant Computing, 20 November 2000
dGartner, High Availability: A Perspective, 15 June 2001
eGiga Group, Trends in High Availability Offerings, 19 June 2000
fGiga Group, Delivering High Availability: Can ASPs Make the Grade? 31 March 2000
gIBM, Business Continuity: New risks, new imperatives and a new approach, 1999
hIBM, Geographically Dispersed Parallel Sysplex: the Ultimate e-business Availability Solution, January 2001
iIDC, Continuity Services: New Rules, New Opportunities, and the Need for a New Service Model, Bulletin, February 2000
jIDC, Clustering and High Availability Software Market Forecast and Analysis, 2001-2005, June 2001
kIDC, Braving the Solutions Storm in the Services Industry: The Tale of the Top 15 Global Players, March 2001
lMETA Group, How High Availability? 17 January 2000
mMETA Group, Database Replication for High Availability: Part 1, 17 July 2001
nMETA Group, The Business Cost of High Availability, 12 June 2001
oYankee Group, Internet Integrators Pulling IT Together, Report Vol. 9, No. 10, August 1999
Accenture, Availability: Definition and Overview, November 16, 2001
aGartner, High Availability Q&A: Failures, Standards, and Metrics, July 29, 1998, QA-05-2701
bGartner, The Cost of Achieving Higher Levels of Availability, June 29, 2001, SPA-13-9852
cGiga Group, Think It's Easy to Obtain High Availability Services? Think Again, November 30, 1999, Giga_809607-AF99.pdf
dGiga Group, What Classifies a System as "Highly Available"?, March 30, 2000, Giga_RIB-032000-00402.pdf
eGiga Group, Evaluation Criteria for Selecting High-Availability Services, December 20, 2000, Giga_RPA-122000-00030.pdf
fGiga Group, High Availability: Evaluating the Costs, March 30, 2001, Giga_RPA-032001-00040.pdf
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access