A cable fault at a high-voltage utility distribution system connected to Amazon’s East Coast data center knocked some services offline Thursday and Friday as power was switched over to its backup generators, according to the AWS status health dashboard. During an audit of its power components conducted after this disruption, Amazon replaced a circuit breaker that was found to be converted to too low of a power threshold, exacerbating the outage for some.
There were performance issues noted with four of Amazon’s AWS cloud offerings for up to seven hours starting Thursday evening, which cut access to sites such as Pinterest for some users. There were error rates and latency issues with EC2, Amazon ElastiCache, Amazon Elastic Beanstalk and RDS, all based at a data center in northern Virginia. In addition, there were connectivity issues for about 30 minutes Thursday night with RDS operations stemming from separate facilities in Oregon and California.
On Friday, there were also service failures and errors in data loading from related power issues at a trio of AWS offerings that operate from that same northern Virginia data hub.
Details of lost or damaged data loads were not disclosed, though Amazon has yet to formally release a post-mortem on the incident as it has done in the past.
As of Monday, all of the 50 cloud services in North America were marked as “operating normally” by Amazon[JLM1] , as were dozens of other cloud operations in South America, Europe and Asia.
In a detail of the outage and subsequent fixes, Amazon reinforced its recommendations to run operations in two of its availability zones for reduced downtime impact. AWS also provided additional links to tips on monitoring and troubleshooting.
“Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional,” Amazon wrote in a status update on Saturday for EC2.
Status dashboard descriptions for this recent outage were more detailed than with two other notable down-time incidents in the 14 months. Incident links on the site from last week outlined not only steps to recover a database entry instance, but also separate, detailed Word documents with primers on how to monitor volumes and load issues.
An unspecified power surge at a Dublin, Ireland data center in August 2011 also disrupted some data monitoring and loading for Amazon cloud services, and a largeroutage in April 2011 was attributed to human error during a configuration change, which led to far wider data recovery problems that spread across availability zones.
Bruce Guptill, SVP and head of research for Saugatuck Technology, says that from all outward appearances, Amazon maintained its reputation with being forthcoming to customers and the public about the causes of outages after last week’s power incident. There is also a level of culpability on end-users at times, as they may not always heed advice and pay for extra failover protections, Guptill says.
Guptill says that, while it may not be of much consolation to users experiencing down-time, it is important to remember that data centers are at the heart of the cloud.
“It’s not some magical, vague thing behind a veil; it’s a huge pile of sophisticated IT equipment that is used as a source for a wide array of network-delivered services,” Guptill says. “Most cloud data centers tend to suffer fewer such problems, because they are designed and built to be as secure and reliable as technologically possible. It is always going to be up to the customer to make sure that what their cloud provider offers is better, not just cheaper, than what they can build and deliver themselves.”