Designing a system to support high-availability applications has evolved from art to science; and barring total geographic blackouts, any required level of availability can be designed and implemented, gated only by the time and the cost to develop and maintain it. Highly available systems require constant attention to the system hardware and network, the application itself and its ongoing management. Senior executives funding a new or extended application should expect initial and periodic justification for its availability requirements and demand ROI reports on its achievements.
As enterprises became dependent on the computer systems, an entire management practice emerged that was focused on improving IT availability and its impact on business productivity. Surprisingly, or not -- given the cost of computer equipment -- the original yardstick was pointed at the central processor itself and the peripherals surrounding it. The CIOs - a title that wouldn't exist for another 25 years - commonly reported system uptime in excess of 90%, although online users were not experiencing service close to that.
It was much later, after online systems and Web access dominated IT, that the impact of system downtime on human productivity could be measured. The human factors gurus noticed that response time delays to user inquiries of more than two seconds broke the inquirer's concentration, requiring a mental reset with lost productivity measured in minutes. System downtime longer than approximately 20 minutes was actually reflected in users changing their task at hand, disrupting processes and yielding effective outages measured in hours when seen from the user's perspective.
Defining High Availability
Availability may be defined as a ratio between MTBF (mean time between failures) and MTTR (mean time to repair). It is calculated much like a probability of failure, using the following formula: Availability = MTBF / (MTBF+MTTR)
In an IT system, the application availability is the result of the aggregation of all the availability factors of all architectural components supporting the application. This aggregation is different whether the components are serial or parallel, which is why we tend to avoid single points of failure (serial) and double or cluster the critical components (parallel).
Most of the effort in high-availability systems has been focused on hardware and system failures and the capability to switch from a failing component to a healthy one so that the service to the end user will not be interrupted for too long. In other words, the typical sequence of problem resolution - detect, analyze, correct - becomes detect, correct and analyze. The trend has been to suppress the repair factor to maximize availability. In a high-availability (HA) system design, the focus is to recognize a failure and shift critical applications to a working equivalent so the user will not even be aware of the hiccup.
The touchstone for engineering design today is 99.999% planned availability or five minutes of unplanned downtime in a normal year. If several hundred components spec'd for 99.999% availability must all work successfully for the application to perform, the combined system availability falls below 98% (0.99999 to the 200 power) - more than 3 hours downtime per week. Users would quickly find such a system unworkable, thus the focus on uninterrupted service.
Improve Systems Availability
IT shops should decide when to use HA strategies by: (1) gathering data on the business impact of application downtime, (2) analyzing the causes and likelihood of planned and unplanned downtime, (3) assessing which HA design strategies can affect which causes and (4) evaluating the extra cost of HA against its resulting benefits.
HA infrastructure ensures that an application has constant availability of network, processors, disks, memory, etc., such that a failure of one of these components is transparent to the application. A risk analysis identifies important functions and assets critical to HA and subsequently establishes the probability of a breakdown in them. Once the risk is established, objectives and strategies to eliminate avoidable risks and minimize impacts of unavoidable risks can be set. For most hardware, middleware and operating systems, this means duplication and physical separation of IT systems, reducing single points of failure, and clustering and coupling applications between multiple systems. Each of these must be considered individually and as an integral hardware component of the total solution.
Beyond designing the hardware for HA, however, there must be significant thought given to the application itself, the underlying database and management issues ranging from simple reporting to complex root cause analysis.
Infrastructure Design for High Availability
The Storage Subsystem
Data is the lifeblood that runs through the modern business application, and its availability and integrity must be preserved to maintain an application's value to the enterprise. It is possible to configure the underlying storage subsystem to meet as stringent an availability requirement as the enterprise's budget will support. In Figure 1, data is mirrored within the data center by an intelligent switch topology in the configuration on the left. This allows failover and recovery outboard of the server transparent to the application. With appropriate RAID setup in the primary and failover arrays, and alternate pathing in the fabric of the storage network, the entire data center would have to fail before the application would be interrupted for an unplanned outage. In addition, existence of the alternate data stream and pathing allows operations to disable parts of the topology for upgrades and maintenance to enhance the ongoing availability of the total system.
Figure 1 Data Mirroring
SOURCE: FORRESTER RESEARCH.
In this configuration, data may be synchronously replicated either by the server or the intelligent, in-band switch. This is known as a Time 0 or T0 copy because the multiple versions always look exactly alike.
The Processor and its OS
Clustered server architectures provide the benefits of both high availability and performance scalability. Cluster packaging comes in many different forms: (1) multiple standalone servers (with high-speed cluster interconnects), (2) multiple servers in a box (this would include new high-density servers as a category), (3) multiple partitions within an SMP or (4) any combination of these. A single-system view is an important component of a cluster high-availability environment. As nodes are added to a cluster, the requirement to manage distributed cluster resources as if managing a single server becomes a critical differentiator in the selection of a high-availability system. As shown in the left configuration in Figure 2, regardless of how the storage pool is connected to the processor cluster, the workload of a failing node will be taken over by another processor in the cluster.
Figure 2 Failing Over
SOURCE: FORRESTER RESEARCH. IN THIS FIGURE, Tn REFERS TO THE TIME DELAY REQUIRED TO SYNCHRONIZE THE REMOTE COPY AFTER THE FAILURE, IN CONTRAST TO T0 WHERE THE SWITCH MAY BE MADE INSTANTANEOUSLY.
Depending on the operating system sophistication and the storage setup, Figure 2 also suggests that there are extremes in availability. In the center case, the primary storage device fails, but either the IO subsystem or the intelligent switch has mirrored the data to another storage device. Thus, the cutover is instantaneous and transparent to the application. In the right-hand case, a classic example of zero consideration given to availability, the server fails, and the data and the application must be ported to a totally different system. In this case, the application is down n hours, even if the data has been continuously, asynchronously copied to the second system all along.
In many ways, HA is another facet of autonomic or organic IT. The organic IT data center (see Figure 3) and its applications are designed such that IT resources are diverted to the most critical business applications as needed and the application is flexible enough to expand and contract to meet the need. An organic application allows failover to and failback from another node in a cluster. In an "active-passive" configuration, for instance, one copy of an application runs on the primary machine while a secondary instance on a second machine is idle until failover. In an "active-active" configuration, the application runs multiple servers simultaneously with different or shared databases, allowing organizations with more constrained hardware configurations to enable failover to or from any node without having to set aside expensive hardware.
Figure 3 The Organic IT Data Center
SOURCE: FORRESTER RESEARCH.
The Network's Contribution
The network serves several critical functions for every application, delivering requests to servers and results back to users, amalgamating data distributed throughout the enterprise and, in case of data center failure, moving critical applications to their backup sites. Beyond providing detailed reporting required of every IT component, the network itself is subject to path failure and should be laid out such that it can isolate broken paths.
HA Systems Force the Enterprise to Focus on Business Value
Emerging applications such as storage over IP and IP telephony are not going to take hold until router vendors can deliver more highly available IP networks. Resilient routing primarily aims to reduce router convergence times. (Convergence in this context should not be confused with the combining of different information types onto a single network; rather, it refers to what happens in a routed network after a link failure. Routers detect the change, notify affected routers via routing protocols such as OSPF and adjust their routing tables accordingly.) Shortening router convergence times is a critical technique for improving the overall uptime of IP networks. The faster a router knows that it has to choose an alternate path around an unavailable network connection, the less data it will unsuccessfully attempt to forward via that route.
Resilient routing's key features are called non-stop forwarding (NSF) and stateful switchover (SSO). NSF allows a router's route processing engine to be rebooted while its packet-forwarding engine continues to send packets along the last known available route. Adjacent routers are not notified of the restart and thus do not update their routing tables. This so-called hitless restart yields increased network stability, efficiency and ultimately uptime. SSO allows an active route processing engine to pass the state information of key routing and interface protocols to a standby engine during failover, shortening the time it takes the standby engine to learn and converge routes.
Application Contributions to Availability
The Application Operation and Design
Regardless of an application's SLA or underlying infrastructure, architects can improve its availability by applying the HA design strategies that follow, even on non-HA infrastructure. As noted earlier, IT shops should decide when to use these strategies based on cost/benefit evaluations.
REDUNDANCY Each element of an HA application must have a backup that can take over if the primary fails; and the design must account for transactions that were in-flight when a failure occurred.
RECOVERABLE DESIGN Any individual element is more available if it is stateless; but the application as a whole typically is stateful, and state must be preserved across potential failures.
FAILURE DETECTION To be recoverable, an application may have to fail gracefully by saving transaction information, notifying a user or administrator, and performing appropriate application cleanup.
HEARTBEAT CONTROL An HA application must be monitored in real time to ensure it is still running and that there is automatic failover if it is not running.
OPERATIONS MANAGEMENT INTEGRATION Applications may incorporate management APIs to raise alerts, enable full monitoring and management, and write error logs that may also be monitored.
CONNECTION MANAGEMENTThe client side should be designed to handle connection failures and automatically establish connections to alternate providers.
TRANSACTION-AWARE DESIGN HA application design must explicitly anticipate handling of and recovery from transaction failures.
The Underlying Database
More than 80% of all DBMS HA implementations are based on the failover clustering architecture (see Figure 4) that remains the most reliable solution to ensure availability of mission-critical databases. Oracle has been the most innovative, offering Real Application Cluster (RAC), focused on minimizing database outages. Besides failover clustering and RAC, some enterprises use database replication for HA, but it offers manageability challenges. Microsoft and IBM are likely to offer enhanced HA offerings in the next few years, focusing on integrated, simplified and highly improved database availability solutions. Although failover clustering, RAC and database replication are viable HA solutions, each requires: (1) careful planning, (2) additional administrative efforts, (3) operational policies and procedures and (4) end-to-end integrated testing to ensure successful deployment and continued availability.
Figure 4 Failover Clustering for DBMS
As database applications become more demanding, so do SLAs. In a recent Forrester poll, the number one cause of DBMS outages was related to hardware including disk failure, network failure, operating system failure and resource allocation failure. DBMS failover clustering solutions overcome most local hardware-related issues by failing over the database instance to another server. See figure 4 for failover architecture. Typically, a failover clustering solution comprises identical: (1) hardware servers, (2) operating system, (3) patch levels and (4) network cards, connected to a shared storage device. However, careful planning, execution and administration are equally important for a successful HA implementation.
Based on customer feedback, failover- clustering solutions can be an operational nightmare. Some customers have experienced frequent outages that result in more application downtime with HA than those that did not have such a solution. At one site, changes made by a DBA to a configuration file caused the server to crash and failover to occur without any identifiable root cause. At another site, lack of disk space caused a failover to endlessly ping-pong between two nodes. While most were not technology-related issues, the importance of education and understanding the clustering environment cannot be overstated.
Oracle's RAC offers an enhanced HA solution for Oracle databases, using shared-disk architecture supporting multiple nodes. RAC's advantage is that if any node fails, the application can still function, with the surviving nodes continuing to accept connections and taking over failed nodes connections. With RAC, both scalability and availability are achievable and mandatory for mission- critical databases. Both Microsoft and IBM offer clustering using data partitioning via shared-nothing architecture, useful for scalability. However, an HA configuration requires balancing data across all nodes because any node failure could impact the entire application.
Some enterprises also use the database replication feature to support HA for their databases. With database replication, it is possible to support multimaster or one-way replication between two or more databases. Typically, data can be synchronized between the two sites in real time or close to real time. Although database replication can be used for HA, it often requires a great deal of time and effort to manage such environments compared with the failover clustering solution.
Do not deploy an HA solution for all production databases, as managing a clustered environment tends to be complex. Forrester has found that only five percent of all databases in any large enterprise are classified as mission-critical. The most commonly used HA solution for DBMSs is failover clustering. Because deploying an HA solution remains complex, IT organizations should follow strict policies and procedures to ensure availability of mission-critical databases.
Some of the best practices on HA for DBMSs include the following:
- To achieve higher ROI, use active-active failover clustering implementation rather than active-passive.
- Form a cluster support team comprising personnel from various IT organizations including system administration, networking, application and database. Ensure that any changes made to the cluster are first discussed by the cluster support team, including changes to the operating system, security and service patches, database configuration and upgrades, and hardware changes such as network card or cable and storage device.
- Before deploying any clustered solution, ensure that enough testing has been performed. Focus on integrated testing of the entire technology stack, including application, network, server and database.
- To improve availability of databases in a non-clustered environment, minimize changes to hardware and software, do not run other applications or workloads on the database server, minimize log-ins to the server and use a DBMS release one generation below the current release for greater stability.
- Use SAN storage instead of direct-connected storage for greater flexibility and lower cost. With SAN or NAS, storage can be dynamically allocated to database servers that require them, offering greater flexibility, manageability and lower cost, especially when dealing with large databases.
- Always document the HA operational process and procedure clearly. Involve all of the groups (including networking, application, server and database) in defining the failover and fail-back procedure.
Managing HA Systems
Using Management Reports to Enhance Availability
Availability is one of the parameters that is measured and reported in many service-level management products. Availability is usually assessed using several perspectives, either from a technology standpoint (network, servers, etc.), from a geography standpoint or from a process/organization standpoint. Very often, these reports are built using a customization GUI that allows the grouping of components and potentially the capability to drill down on one of them to look at detailed information. Most of the end-to-end frameworks provide a service-level management product based on fault management. Products designed for network service assurance can also provide the basic availability metrics of a distributed infrastructure. Availability can also be measured by instrumenting the end-user workstations with an active or passive response-time measurement agent.
The measures provided in these reports are usually averages, and this yields a relatively poor indication in terms of process robustness. Availability or other infrastructure performance results are linked to good technology and to the robustness of the different management processes supporting it. The basis for this affirmation is the fact that if a poor process is usually the victim of poor technology, a robust process quickly identifies problems and issues and provides an avenue to correct them, and eventually improves the technology used. However, service-level management, as usually reported, provides average values that are not significant in the evaluation of process robustness. Looking at an average alone does not provide an indication of the results dispersion around the mean. In other words, if the service provides widely different levels that cancel each other, a good average can be reported while the end-user perception will be of a poor service.
This has been addressed in different areas of quality management, and the answer is to use a statistical process control methodology that not only provides an average value (how good is the service) but also an estimate of the dispersion or standard deviation (how consistent is the service). The most famous of these approaches is found in manufacturing (Six Sigma), but the concept of statistical process control has also found its way into the capability maturity model (CMM) for software development.
To evaluate the availability of an application to end users and provide an incentive for continuous improvement of the process of managing the different technologies supporting that application, the following approach can be used:
- Implement a service-level management product capable of reporting by application, using groupings of the different technologies used to deliver the service to the different geographies. This should provide a core grouping of network, server and database components, and a series of delivery groupings by geography.
- Determine, for each type of technology, a service level and a tolerance level (for example, 99% availability with a tolerance of -5%).
- For each grouping of components, report the average value and the number of standard variations contained within the tolerance. This will be the "sigma number" for this availability value.
From these reports, average availability levels can be reported by application, geography and technology. Typically, the average value will be an indication of the technology capability and the sigma value an indication of the process robustness. Using this approach, four levels of maturity can be assessed, and accountability and incentives attached to the improvement measured over time:
- The service is off target and inconsistent in its delivery. This is the lowest point.
- The service is on target, but inconsistent in its delivery. The processes lack robustness, and there is no guarantee of results.
- The service is off target, but consistently so. The process is robust, but the technology needs improvement.
- The service is on target and consistently so. The process and the technology are mature.
The traditional network and system management disciplines, based on the venerable simple network management protocol (SNMP) or common management information protocol (CMIP) model, include the following disciplines:
- Fault management
- Configuration management
- Performance management
- Accounting management
- Security management
A key point historically has been the relative inefficiency of the SNMP and managed information base (MIB) agents for root cause analysis. The MIB, the list of parameters accessible through SNMP, is a mixed bag of performance and configuration parameters. Through this, the fault management products reported events, and problem diagnosis could be difficult. The major progress in the area came from support for "layer 2 discovery" (i.e., understanding how a managed object relates to the infrastructure by gathering data beyond the standard MIB). Analysis tools build a "topology" of the network and of the distributed objects automatically. Thus, an intelligent agent can determine the root cause of problems and their impact on HA applications.
Different agent techniques are used to perform root cause analysis:
- Rule engine, which has been the basis of all "old technology" agents. This is a good technique for small environments, due to the limited number of possibilities.
- Model-based reasoning, which uses a model of relations between components to determine which element is at fault.
- Codebooks, which use event filtering to determine "problem signatures" and compare them to a coded knowledge base to determine the root cause of problems.
The advantage of these techniques is that once the root cause has been identified, an impact assessment on applications can be determined and resolved. This type of analysis leads naturally into availability reviews.
Designing Availability Reviews
High-availability reviews identify and mitigate exposures to system failures. It is critical to conduct the pre-production review early in the cycle to identify possible flaws in HA design and periodically conduct operational reviews to ensure that the cumulative effect of system changes haven't eroded the HA design.
APPLICATION ARCHITECTURE REVIEW Was the application itself designed to reduce or eliminate single points of failure? Are the selected platforms and middleware operationally sound? Is operations involved early in the design phase and the ongoing change control review practice?
TECHNICAL INFRASTRUCTURE REVIEW This includes hardware configurations, redundant data paths and links, clustering technologies, site hardening and environmental controls, etc.
SOFTWARE PRODUCT PARAMETER REVIEW This is a detailed review of system parameters and settings for critical elements in the software stack to ensure all availability and performance features are appropriately exploited.
APPLICATION MANAGEMENT Are appropriate resource and process monitoring tools in place and used effectively?
- Is threshold management and alerting used effectively to avoid outages or critical resource shortages?
- Are all problems effectively documented (systems, impact, recovery activities, times, personnel, etc.)?
- Is a root-cause analysis conducted effectively?
- Do nearly 100 percent of problems result in automation or other permanent fixes to ensure they will not recur?
- Is any trending or pattern analysis conducted on a periodic basis?
- Is there a periodic audit to ensure that there has been follow-through on resolution activities?
- Are problems tied to change-control events to validate risk analysis?
These reviews can be time-consuming and expensive to conduct. Pragmatically, begin with trouble tickets and outage reports to isolate key problem areas for the focus of the availability review efforts. Beginning the review with problem/outage incidents helps identify the low-hanging fruit to make meaningful availability improvements.
Testing for High Availability
Over time, operations will build a working profile of each hardware component's reliability and can use that to project a complex system configuration's availability. It is almost impossible to test failover to a remote backup site if that is part of the availability design once a system is in production. If possible, do this at least once before putting a new system in production and periodically run a paper test with the actual backup staff walking through the process. It is possible, however, to test failing paths for mirrored storage, router failures, bad tapes, etc., and these should all be stress-tested annually. However, when this concept is applied to software components, and specifically to application packages, this testing approach tends to break down:
- pplication software is probably the least reliable and least tested component of the infrastructure chain.
- The same cause produces the same effects; if an error exists under a set of conditions, it will still exist on the backup equipment or in recovery mode, even if the hardware is fault tolerant, because it will still be running the same application code.
- The repair time - the MTTR factor - of software tends to be unpredictable. Correcting software problems is based on reproducing the conditions that led to the problem. In addition, in timing sensitive or race condition failing modes, this is a very daunting task.
The focus in application availability is on developing best practices (code inspections, development standards, etc.) and on thorough testing. As good as any organization may be, there will always be cases not tested and users who find some way to create untested conditions by expanding the code applicability beyond its design point.
Easier to Discuss than Maintain
HA systems can help an enterprise maintain its position as an industry leader; failures can undercut its reputation and expose it to customer defection and ongoing legal risks such as lawsuits and regulatory penalties.
HA cannot be added onto running systems. It must be systematically built in and systemically driven throughout the enterprise. If not, Murphy's Law will prevail, and critical systems will break at the least convenient moment.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access