August 16, 2011 – After a power failure caused disruptions and “stuck” work loads with its cloud service, Amazon states it will add redundancy, more isolation to electrical controllers and added recovery capabilities at its data centers and among vendors.

Amazon published a postmortem on the details and architecture behind its second recent incident with its Amazon Web Services, which knocked out power and made some workloads inaccessible for hours and without recovery snapshots for days. On Aug. 7, an unknown power surge – originally but erroneously attributed to a lightning strike that downed a transformer – blitzed EU West, Amazon’s data center in Dublin, Ireland. Backup power did not kick in like expected, causing stuck data loads and disconnects with cloud services such as its Elastic Compute Cloud or Elastic Block Store, according to the postmortem.

A software error Amazon stated was unrelated to the power problem further hindered recovery, pushing back data snapshots for up to 70 hours and causing 2 percent of those to be rebuilt manually, according to the postmortem.

While the source of the disruptive power surge remains under investigation by Amazon and its Ireland electricity provider, Amazon said it will add redundancy and isolation for its data center power connecters, or PLCs. In addition, Amazon plans to implement better load balancing, direct data volume recovery upon the restoration of power, and fix of the bug that delayed recovery snapshots.

Along with giving customer credits and expanded support to those impacted, Amazon said it would increase staff response to “the early hours of an event” and communication on the recovery process going forward.

“As we were sending customers recovery snapshots, we could have been clearer and more instructive on how to run the recovery tools, and provided better detail on the recovery actions customers could have taken. We sometimes assume a certain familiarity with these tools that we should not,” the provider stated.

It was the second major service disruption for the provider in recent months, after human error in data loading cut or slowed service for numerous customers over a weekend in April. Following concerns raised about that disruption and Amazon’s delayed public response to the situation, the provider stated it would step up communication with customers, cut downtime recovery and announced a partnership on tools with SAP.

 

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access