In today's dynamic business environment, organizations of all sizes are finding it more important than ever to maintain access to critical applications and data when hardware or software failures occur, even in the event of a natural or man-made disaster. All too often, however, failures do occur, resulting in nonavailability of the applications when the computing system stops. Whatever the cause, every computing system is vulnerable to failing at some point, and a strong business continuity plan is crucial to any organization. 
For those who have considered protection against computing failures, various solutions have been proposed. They range from various backup schemes, file replication and hard disk mirroring to more dynamic solutions that use fully redundant hardware systems running in lockstep to automatically failover a computing system. Those solutions that can dynamically restore the application are sometimes referred to as high availability solutions, where restoration time can be less than twelve hours in the event of a system failure. There are also continuous availability solutions that keep running even in the event of a catastrophic hardware failure. So how do you decide when to use a high availability solution or a continuous availability solution? And what will best meet your business needs?

Why You Need a High Availability Solution in Place

The availability of computing systems is an important topic for both personnel and others running solutions outside the data center. Availability simply means to keep the computing system running to perform the tasks for which it was designed. All too often, however, failures occur resulting in the nonavailability of the application when the computing system stops. There are three primary reasons a server will fail:

  • Hardware failures. A critical hardware component fails such as a CPU, hard disk, motherboard, power supply, etc. This is by far the most common reason a computing system will fail.
  • Software bug. A software failure can bring down a system, such as a memory leak or functional fault. Typically the software manufacturer has addressed the most important bugs in their system, and most software bugs are small non-relevant errors that will not bring down a server. Therefore, software faults that stop the functioning of a server are rare.
  • Facility disaster. Even if all the hardware and software are working fine, the system will come down if it is destroyed either by fire, water, natural disaster or manmade disaster.

How to Achieve the Highest High Availability

High availability solutions that restore a system are typically based on server clustering. They consist of two or more servers (physical and/or virtual) running with the same configuration and are clustered together with a software solution to keep the application data updated on both servers, and to restart the backup server in the event of a failure of any kind on the primary server. 
Fast-Recovery High Availability Solutions Benefits and Options

  • Simple server failover. provides application recovery by automatically assigning the identity (i.e., host name and IP address) of a primary server to a standby server while restarting the application on the standby server. Generally, the standby server needs to be configured identically to the primary server including operating system and application but runs in passive mode waiting to take over the application workload via automatic failover when the primary server becomes unavailable.
  • High availability clustering. typically offers recovery of critical applications organized as groups of resources that include the target application, virtual server identity and data resources. The resource groups are configured across multiple servers within the cluster that provide the necessary resources to run the target application and data. In the event of a primary server failure, the cluster management software will automatically failover all resource groups from the primary server to the standby server. The typical failover process of a resource group includes the removal of group virtual server identity from the primary server and reassignment to the standby server followed by starting the group target application on the standby server.
  • High performance clustering. load-balancing clustering is where a group of servers have identical components so each server can take on a portion of the overall workload. Typically, the application components can be added or removed dynamically to accommodate the overall application workload. This architecture is inherently highly available and resistant to single server failures. However, not many applications can run in such an architecture because it requires the applications to be designed for this architecture from scratch.

High Availability Clusters Limitations

  • Downtime doesn’t go away. Some downtime will be experienced when the server fails. All users that were connected to the failed server will be disconnected, and any data in memory that had not been written to the database will be lost. Time will be required to restart the server even if the system is using high availability cluster software, and users will need to wait for the backup system to start and/or login again and restart the work that they were performing.
  • Restoration cost doesn’t go away. When a server fails, the cost of restoring the failed server is not minimal. Though the server may be under a hardware warranty where the failed part is replaced without cost, the cost of getting the server back up and running once the power is turned back on can be significant. Typically, the server will be able to boot, assuming the hard disks were mirrored with an appropriate RAID configuration. However, if the server is in a remote location, there could be travel costs. Also, time is required to reset the cluster software and, depending on the configuration, another failover might be required to put the primary server back into production. These costs are important to consider for software cluster high availability solutions.

How to Achieve the Highest Continuous Availability
Continuous availability computing is also referred to as fault tolerant computing. Fault tolerance can be defined as the ability of a system to continue to operate in the event of one or more failures. Both hardware and software fault tolerant continuous availability solutions are on the market. Fault tolerant systems are generally classified as having at least 99.999% uptime; translated as approximately five minutes of average downtime per year. The only way a fault tolerant server can go down is when both of its redundant modules fail during the repair period of the first failure. This is extremely rare.
Hardware Non-Stop Continuous Availability Benefits and Options

  • Hardware fault tolerant. solutions have a proven track record. Airplanes, for instance, have many redundant systems to provide continuous functioning. In the event of one or more failures of a single or multiple components, the system will still be able to continue functioning—human lives depend on it. In computing, mainframe systems fall in the classification of hardware fault tolerant through their redundant parallel processing capabilities. In the event of a hardware fault, the redundant server module will continue to run the application. While the system is running, the entire failed module can be pulled out and replaced under warranty no matter what part in the server module had failed. Once the replacement module is inserted, all modules will automatically re-sync back in lockstep within a few minutes; no additional configuration or management is required, making this solution the easiest to repair with no reboot requirements or planned downtime for any reason.
  • Software non-stop continuous availability. or software fault-tolerant solutions have become available. These complex solutions perform the same functions as a hardware fault-tolerant solution but are done using software. Software fault-tolerant solutions must run at the basic input/output system (BIOS) level of a machine in order to keep in lockstep. To simplify the engineering, most software fault-tolerant solutions are built on top of a virtualization hypervisor integrating the virtual machine guest operating system BIOS. Fault Tolerance creates virtual machine “pairs” that run in lockstep – essentially mirroring the execution state of the virtual machine. To the external world, they appear as one instance, but in reality they are fully redundant instances. If something happens to the primary virtual machine, such as a server or virtual machine failure, the secondary (or “shadow”) virtual machine takes over immediately. As true with all nonstop fault-tolerant technology, the failover process is instant, and it does so without any loss of data, network connections or transactions.

Fault-Tolerant Solutions Limitations

To date, software fault-tolerant solutions have not yet captured wide acceptance. This is primarily due to technical limitations:

  • Typically limited to single core processing.
  • Requires dual operating system and application license purchases.
  • Use of a hypervisor, possibly adding complexity and cost.
  • Performance overhead on the Xeon processors.
  • Added software configuration services may be required.
  • Added annual software maintenance fees.
  • May require SAN for implementation adding cost.

What’s the Best Availability Solution for Your Business?

For single site protection, continuous availability solutions often provide nonstop business continuity. On the other hand, high availability solutions, like clustering products, are ideal for disaster recovery if you are looking to protect your business from disruption due to system outages caused by either unplanned downtime, such as fire, terrorism or natural disaster or planned situations, such as data migrations and system upgrades.
No matter what availability solution suits your business, one thing is for certain — a strong business continuity plan is essential to all organizations. Maintaining access to critical applications and data when hardware or software failures occur, or even in the event of a natural or man-made disaster, it is essential for any business – whether your mission is saving lives or providing a product or service. Make sure your computing system is invulnerable to failure by putting a business continuity plan in place and always having a safety net. 

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access