March 12, 2012 – Microsoft will bolster its Azure cloud platform testing, controls and tools, as well as beef up customer communications in response to the “extraordinary nature” of an 18-hour disruption triggered by a Leap Day software bug.
In a blog on the incident posted Friday evening, Microsoft lead engineer Bill Laing wrote that the incident occurred in two phases. Initially, a software bug connected to the oversight of Leap Day information included in Azure’s customer access led to Microsoft taking the entire platform offline for a few hours. That was followed by corrupted service offerings from clusters that remained connected, particularly impacting servers handling some U.S. and northern European customers.
The software bug and outage created a disconnect between cloud offerings and Microsoft’s virtual machines running on data center server clusters. Microsoft was able to iron out some service problems a few hours after they were first reported February 29 from the addition of the extra calendar day to its access and service tickets. However, some clusters were impacted into early March 1 because of service applications updates and operations customers had submitted that were running without the fixes and in isolation though its scaled-out, redundant platform software component called the Fabric Controller. Microsoft stated that the outages did not impact its storage servers because they do not run as a “guest agent” like those services offerings on the virtual machines.
In response, Microsoft noted it would enhance its software testing on the cloud platform, including add-ons for granular control of data accessed or stored in its offerings, fault detection beyond those reliant on sign-in time-outs and faster error notifications to technical support. The cloud customer support dashboard, which was also unavailable at points during the outage, will now be spread across separate infrastructures, with the expectation of more and up-to-date information moving forward. In addition, Microsoft will look to keep customers in touch via social networks like Facebook and Twitter. Laing wrote that there have already been modifications made to internal tools and hardware dependencies.
“We will continue to spend time to fully understand all of the issues outlined above and over the coming days and weeks we will take steps to address and mitigate the issues to improve our service,” Laing wrote. “We will strive to continue to be transparent with customers when incidents occur and will use the learning to advance our engineering, operations, communications and customer support and improve our service to you.”
Cloud customers of Windows Azure Compute, Access Control, Service Bus and Caching will also get a 33 percent credit regardless of the impact to their service during that billing cycle, according to.the post-mortem blog post.
The blog did not detail the number of customers impacted. It was the second major disruption of the Azure platform since 2009, though Laing wrote it was the first time they had taken the step of disabling service management functionality.
Days before the post-mortem, Microsoft also announced reduced pricing for compute and storage cloud customers, piggybacking on entry-level cost reductions the provider made in February, as well as additional enterprise resource planning offerings. Earlier in the week, Amazon Web Services and Google also cut the price for their respective cloud platforms.