April 29, 2011 – Amazon has vowed to speed downtime recovery and increase customer communications with its cloud service users following failures last week that the company attributed to human error during a configuration change.

In a postmortem it released Friday, Amazon apologized for the stretch of days of unreachable data and service interruption with its Elastic Cloud Compute (EC2) and Relational Database Service (RDS).

“As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes,” the company stated.

The disruption began at 12:47 a.m. PT April 21 in the company’s EC2 storage, called Elastic Block Storage (EBS), as a human incorrectly shifted redundant data traffic during a configuration change to upgrade the capacity of the primary network, the postmortem outlined. This caused user traffic to be routed instead to a slower, redundant storage node, resulting in a lost node connection, a run on space and a time-consuming process of readjusting servers to free up “stuck” information.

Interruptions and downtime were reported from this event through April 23, though services have been fully restored since earlier this week for both services, housed at a North Carolina data center. Cloud services for Amazon’s more than two dozen other cloud services in the U.S., as well as others in Europe and Asia, reported no service interruptions during this same time period, according to Amazon Web Services logs.

Along with increased downtime investment with AWS, all users, regardless of service interruption, will receive a 10-day credit due to this event. The company continues to audit the event but stated it will look for more automation in the configuration process.

With only infrequent updates on its services health dashboard during the sporadic service, Amazon in its postmortem stated it would communicate with customers more regularly and clearly during future operational issues.

Aside from the postmortem, Amazon media officials did not return media requests seeking further details on customer data it had stated earlier in the week could be beyond recovery.

Katie Broderick, IDC senior research analyst on servers and data centers, says Amazon’s solid reputation of uptime and cloud service will probably not be greatly damaged from these service failures. Partly, Broderick says the event should show organizations that cloud providers are “not magical” and untouched by downtime. With these quirks worked out, she says confidence in adoption will rise quickly.

“Ironically, I could see this pushing cloud adoption faster, because the sooner these issues are understood and dealt with, the sooner more mission-critical applications will be ready for the cloud,” Broderick says. “I think a lot of enterprises are waiting for public cloud to be ‘enterprise ready’ and do not want to be the guinea pigs.”

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access