October 29, 2012 – Amazon states a hardware replacement at a Virginia data center that has been the site of other recent outages trigged a data collection bug that downed service for “a fraction” of its cloud customers early last week.
In a postmortem message posted Friday night, Amazon Web Service team members wrote that they would be rolling out fixes this week that include alarms for memory leaks, modified memory monitoring for storage servers, updated internal DNS configuration and a change to failover logic with its Elastic Block Store offering that led to “rapid deterioration” of service.
The service disruptions, which hit customers such as Netflix, were first reported at approximately 1:00 p.m. EST Monday, Oct. 22 and piled up over the next three hours until troubleshooting relieved the issues by 7:15 p.m. EST, according to the postmortem.
AWS team members wrote that a “small number” of data volumes at its EBS were “stuck” and unable to process I/O requests from a server replacement that neglected to include an updated server address for data collection. There was then a ripple effect with failover backup to its Amazon Relational Database and Elastic Load Balancing services as well as customer APIs on Amazon’s EC2 cloud based on throttling provisions AWS has in place to manage loads during disruption events.
“Other than a few short periods, our monitoring showed what looked to be a healthy level of launch and create activity throughout the event. However, we’ve heard from customers that they struggled to use the APIs for several hours. We now understand that our API throttling during the event disproportionately impacted some customers and affected their ability to use the APIs,” the AWS team wrote in the postmortem.
In an apology at the conclusion of the postmortem, the AWS team wrote that it would offer credits to customers impacted by the “aggressive” throttling policy and noted that the event has taught its team “about new failure modes.” The postmortem did not include any total for those knocked offline or extensive explanation of how previous outages contributed to ways the team stated it was able to address the disruptions.
According to the AWS service health dashboard, the impacted offerings were relegated to its northern Virginia data center. This is the same cloud data center knocked offline for some users due to a power outage caused by a technical malfunction in June and an outage due to a storm in July. At that time, the major cloud vendor promised more open and immediate communication with customers about disruptions, though there was plenty of grumbling on the outage by users and businesses via Twitter early last week. There were also false claims that the disruption was caused by hacktivists.
All services were reporting as fully operational Monday morning, according to the AWS dashboard.