Story image

Azure outage postmortem: Microsoft reveals what happened and why

12 Sep 2018

Since last week headlines around the world have been painted with headlines shouting about the disruption to Microsoft’s services after a severe weather event knocked one of its data centers offline.

Essentially, the cause was blamed on high storms in the Texas area that resulted in power swells and ultimately ended in the temporary demise of one of the company’s South Central US data centres in San Antonio.

In a recent blog post, Microsoft Azure DevOps director of engineering Buck Hodges has released a ‘postmortem’ of what went down, why it happened, and what the company is doing to prevent similar incidents in the future.

“First, I want to apologize for the very long VSTS [now called Azure DevOps] outage for our customers hosted in the affected region and the impact it had on customers globally,” says Hodges.

“This incident was unprecedented for us. It was the longest outage for VSTS customers in our seven-year history. I've talked to customers through Twitter, email, and by phone whose teams lost a day or more of productivity. We let our customers down. It was a painful experience, and for that I apologize.”

The Azure status report reveals the data center switched from utility power to generator power following the power swells caused by the lightning, however, the mechanical cooling systems were also a victim of the power swells despite having surge suppressors in place.

While the data center was able to continue operating for a period of time, temperatures soon exceed safe operational thresholds which initiated an automated shutdown. While this blackout is an initiative to preserve infrastructure and data integrity, in this case temperatures rose so quickly that some hardware was damaged before it could be shut down.

Many asked why didn’t VSTS simply fail over to a different region.

We never want to lose any customer data. A key part of our data protection strategy is to store data in two regions using Azure SQL DB Point-in-time Restore (PITR) backups and Azure Geo-redundant Storage (GRS),” says Hodges.

“This enables us to replicate data within the same geography while respecting data sovereignty.Only Azure Storage can decide to fail over GRS storage accounts. If Azure Storage had failed over during this outage and there was data loss, we would still have waited on recovery to avoid data loss.

“Azure Storage provides two options for recovery in the event of an outage: wait for recovery or access data from a read-only secondary copy. Using read-only storage would degrade critical services like Git/TFVC and Build to the point of not being usable since code could neither be checked in nor the output of builds be saved (and thus not deployed). Additionally, failing over to the backed up DBs, once the backups were restored, would have resulting in data loss due to the latency of the backups.”

Hodges says the team is now in the process of making a number of changes based on the learnings from the outage, including:

  1. In supported geographies, move services into regions with Azure Availability Zones to be resilient to data center failures within a region.
  2. Explore possible solutions for asynchronous replication across regions
  3. Regularly exercise fail over across regions for VSTS services using our own organization.
  4. Add redundancy for our internal tooling to be available in more than one region.
  5. Fixed the regression in Dashboards where failed calls to Marketplace made Dashboards unavailable.
  6. Review circuit breakers for service-to-service calls to ensure correct scoping (surfaced in the calls to the User service)
  7. Review gaps in our current fault injection testing exposed by this incident.

“I apologize again for the very long disruption from this incident,” concludes Hodges.

Server Technology beats out competition at DCS Awards
Server Technology has taken out the top spot for the Data Centre PDU Innovation of the Year at the DCS Awards.
LogRhythm releases cloud-based SIEM solution
LogRhythm Cloud provides the same feature set and user experience as its on-prem experience.
IGEL & ControlUp bring analytics to endpoints everywhere
The strategic partnership allows IGEL to integrate with ControlUp’s real-time monitoring and analytics capabilities via the IGEL Universal Management Suite (UMS).
Nutanix evolves multicloud offerings
Nutanix has expanded its multicloud solutions portfolio to further evolve its offerings across public and private cloud.
Bluzelle launches data delivery network to futureproof the edge
“Currently applications are limited to data caching technologies that require complex configuration and management of 10+ year old technology constrained to a few data centers."
Exploring the different needs for cloud services across Europe
Although digital transformation is happening across Europe, each country continues to have its own IT needs and the different cloud markets highlight this.
Trend Micro introduces cloud and container workload security offering
Container security capabilities added to Trend Micro Deep Security have elevated protection across the DevOps lifecycle and runtime stack.
Veeam joins the ranks of $1bil-revenue software companies
It’s also marked a milestone of 350,000 customers and outlined how it will begin the next stage of its growth.