January 2, 2013 – An Amazon Web Services developer mistakenly ran a load balancing maintenance process that delivered an “inopportune” disruption to some cloud customers during Christmas Eve and Day.
Data to manage the configuration of AWS’ Elastic Load Balancing in the region were deleted from a maintenance process that had been “inadvertently run,” with the first disruptions reported at approximately 3:30 p.m. EST on Christmas Eve, according to an AWS postmortem.
According to the outage review from the AWS team: “This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers.”
The initial method used to restore ELB state data took hours and “failed to provide a usable snapshot of the data,” and AWS technicians worked into Christmas Eve night until an alternate recovery process was found, according to the postmortem. In all, the load balancing issue impacted customers of Amazon’s CloudSearch, EC2 and Elastic Beanstalk cloud offerings located in its northern Virginia, or US-East Region, data center. Approximately 24 hours after the disruption began, AWS reported that service was restored and load data was recovered on the afternoon of Christmas Day.
AWS wrote that it has modified access controls to ELB state data without change management approval to prevent the developer error that triggered the event. In addition, AWS plans to implement the state data recovery method it learned from this event for “significantly faster” fixes, as well as reprogram its ELB control plane so that service data is connected with an up-to-date load balancer status.
Although Amazon did not release the number of users impacted, the postmortem indicated that 6.8 percent of ELB customers from the Virginia data center experienced downtime at the peak of the event, sometimes for a “prolonged period.”
Netflix publicly pointed to the AWS outage as the reason it said customers couldn’t connect with movies during the holiday. Forrester Research Analyst Rachel Dines, a Netflix customer who has reviewed Amazon’s cloud services, wrote that the Christmas outage should be a lesson to customers to maximize the resiliency of the applications they deploy to the cloud.
Services have been operating normally across Amazon’s instances in North America and elsewhere since the Christmas Day fix, according to the vendor’s health dashboard.
In October, AWS data centers including those same ones in northern Virginia went down after a hardware replacement triggered a data collection bug. Some Amazon cloud customers connected to this data center were also knocked offline in June and July from power outages caused by a technical issue and a storm, respectively.