What Is Happening? In August, Amazon Web Services (AWS) experienced two service outages in facilities on different continents. Not surprisingly, there has been considerable media and analyst focus and hyperbole on the causes, durations, and impacts of the outages.
Saugatuck continues to believe that Cloud IT is a phenomenon, and Cloud IT providers and services in general, can and should be relied upon to deliver critical IT-as-a-service. Cloud services and providers tend to be built for, and managed with, greater reliability and security than the vast majority of traditional data centers and on-premise systems.
But we are surprised to see that “conventional wisdom” among users appears to have become that Cloud-based services are somehow impervious – or at least less prone to – the types of failures and outages suffered by typical private data center infrastructures. Saugatuck’s ongoing research indicates that current and prospective users of Cloud IT continue to express surprisingly high levels of expectation and complacency when it comes to outsourcing critical IT to Cloud providers.
While the specifics of the outages are important, Saugatuck considers it more important for Cloud IT users to view the outages as highly-visible motivations to “level set” their expectations about Cloud offerings and to sharpen their focus on Service Level Agreements.
Why Is It Happening? The outages experienced by AWS this week do not prove, or even suggest, that Cloud IT is fragile or somehow not ready for production workloads. In fact, typical vendors of Cloud IT offerings invest heavily in skilled staffing, IT infrastructure, and site facilities with the goal of delivering highly available services. The result is that despite the rapid growth in usage, Cloud services (including those of AWS), are some of the most reliable IT infrastructures in the world.
However, as history has taught us repeatedly (i.e., consider the ‘unsinkable’ Titanic), human constructs are not impervious to failure (read post, “The Last Word: Clouds Fail, So Plan and Manage Accordingly”). This is particularly true for something as complex as the infrastructure underlying any Cloud IT offering. Recall that the availability of a group of components is the product of all of the individual component availabilities. For example, the overall availability of 5 components, each with 99 percent availability, is: 0.99 X 0.99 X 0.99 X 0.99 X 0.99 = 95 percent.
Quiz any experienced enterprise IT leader, and they will understand this complexity and inter-relationship, and the effects on IT reliability and availability. They “get” the complexities of Cloud IT, in other words. And yet, they, and their associated business executives / leaders, continue to maintain extremely, possibly unrealistic high expectations of Cloud IT and its providers. Saugatuck attributes this to four factors:
- The Hype Machine. Cloud IT is such a phenomenon, and so widely hyped, that IT leaders cannot be expected to stay away from it, especially when they see extremely attractive pricing for useful IT services. Once drawn to it, they tend to use Cloud IT for more and more critical operations. Cloud IT is so easily acquired as point solutions that enterprises often do not realize the extent to which they actually rely on it. “In for a penny, in for a pound,” is the way the CIO of a European central bank explained his IT group’s Cloud buy-in patterns to Saugatuck earlier this year.
- A “Security halo.” At the same time, Cloud IT providers have been doing a tremendous job in addressing customers’ and prospects’ worries about the relative security of their data and communications. Much of this builds on providers’ promotion of their core IT capabilities, including availability and reliability. Providers’ ability to assuage customers’ fears about security have gone a long way toward increasing customers’ expectations about availability and reliability as well.
- Out of sight, out of mind. Outsourcing in general tends to remove operations, and therefore management’s awareness of operations, from daily oversight. This leads to a mindset of diminished responsibility – or the de facto transfer of responsibility from the enterprise to the outsourcer(s). In other words, “it’s their problem now.”
- Faith and trust. Finally, putting faith and trust in any outsourcer will usually inflate the expectations regarding that outsourcer’s abilities. Such increased faith and trust, regardless of SLA specifics, will help to inflate buyer / customer expectations regarding availability and reliability (along with security).
Saugatuck’s position is this: As with any data centers, Cloud IT offerings are susceptible to outages resulting from human errors, programming errors, network problems, or even natural disasters.
While buyers and users should continue to investigate Cloud IT services and providers as alternatives and adjuncts to traditional data center capabilities, service outages such as those experienced this week by AWS provide current and potential Cloud IT users with two exceptionally valuable lessons:
- Do not assume Cloud IT is “fail-proof”: Cloud IT offerings are not immune to outage. We fear that it will take several more noteworthy outages for a majority of both business and leaders learns this lesson.
- Scrutinize Cloud IT Service Level Agreements (SLAs): Customers must evaluate SLAs and implement appropriate actions to assure desired/required levels of service, particularly availability (please read “Is Everything Negotiable? Key Points to Consider When Negotiating SaaS SLAs”).
Saugatuck urges users of Cloud IT to be prudent and to practice comprehensive diligence when implementing a Cloud-based workload (read: “Cloud IT Guidance: Evaluating Workloads for Cloud Migration”; and “Toward Hybrid Workload Management”). In summary, Saugatuck recommends the following steps:
- Thoroughly investigate all of the “fine print” in vendor contracts. Understand precisely which elements of your workload are covered (e.g., compute capacity) and which are not covered (e.g., data bases) by an availability SLA.
- Further, understand if certain capabilities (e.g., primary and backup compute capacity in different locations) must be purchased to “activate” an availability SLA.
- Then, similar to planning for disaster recovery, evaluate and select a balance between the desired level of availability and the costs of attaining it. Consider all the components/functions required for proper operation of your workload are similar to a chain. And, that chain is only as strong as its weakest link. Keep in mind that a “law of diminishing returns” typically applies to availability of any IT system. Specifically, enhancing a system to move from an availability of 99.0 percent to 99.5 percent is typically less expensive than enhancing a system to move from 99.95 percent to 99.99 percent.
- Lastly, plan for the “unthinkable”. Use of any Cloud-based offering should include a disaster recovery plan for the possibility that the Cloud-based offering stops working. This should include a plan to replace the failed or unavailable Cloud service. Some recent outages of highly visible Cloud-based offerings have demonstrated that recovery could be protracted and potentially cause significant impact on your business.
AWS’ experiences in August should be used as motivators for learning and taking appropriate action. A mid-west farm saying is: history repeats itself – until you listen. Take heed of these incidents and take the steps detailed above. Such actions may not eliminate all outages, but, they can reduce the duration of outages and mitigate their impact on your business.
This blog originally appeared at Saugatuck Lens360.