Story image

Azure outage postmortem: Microsoft reveals what happened and why

12 Sep 2018

Since last week headlines around the world have been painted with headlines shouting about the disruption to Microsoft’s services after a severe weather event knocked one of its data centers offline.

Essentially, the cause was blamed on high storms in the Texas area that resulted in power swells and ultimately ended in the temporary demise of one of the company’s South Central US data centres in San Antonio.

In a recent blog post, Microsoft Azure DevOps director of engineering Buck Hodges has released a ‘postmortem’ of what went down, why it happened, and what the company is doing to prevent similar incidents in the future.

“First, I want to apologize for the very long VSTS [now called Azure DevOps] outage for our customers hosted in the affected region and the impact it had on customers globally,” says Hodges.

“This incident was unprecedented for us. It was the longest outage for VSTS customers in our seven-year history. I've talked to customers through Twitter, email, and by phone whose teams lost a day or more of productivity. We let our customers down. It was a painful experience, and for that I apologize.”

The Azure status report reveals the data center switched from utility power to generator power following the power swells caused by the lightning, however, the mechanical cooling systems were also a victim of the power swells despite having surge suppressors in place.

While the data center was able to continue operating for a period of time, temperatures soon exceed safe operational thresholds which initiated an automated shutdown. While this blackout is an initiative to preserve infrastructure and data integrity, in this case temperatures rose so quickly that some hardware was damaged before it could be shut down.

Many asked why didn’t VSTS simply fail over to a different region.

We never want to lose any customer data. A key part of our data protection strategy is to store data in two regions using Azure SQL DB Point-in-time Restore (PITR) backups and Azure Geo-redundant Storage (GRS),” says Hodges.

“This enables us to replicate data within the same geography while respecting data sovereignty.Only Azure Storage can decide to fail over GRS storage accounts. If Azure Storage had failed over during this outage and there was data loss, we would still have waited on recovery to avoid data loss.

“Azure Storage provides two options for recovery in the event of an outage: wait for recovery or access data from a read-only secondary copy. Using read-only storage would degrade critical services like Git/TFVC and Build to the point of not being usable since code could neither be checked in nor the output of builds be saved (and thus not deployed). Additionally, failing over to the backed up DBs, once the backups were restored, would have resulting in data loss due to the latency of the backups.”

Hodges says the team is now in the process of making a number of changes based on the learnings from the outage, including:

  1. In supported geographies, move services into regions with Azure Availability Zones to be resilient to data center failures within a region.
  2. Explore possible solutions for asynchronous replication across regions
  3. Regularly exercise fail over across regions for VSTS services using our own organization.
  4. Add redundancy for our internal tooling to be available in more than one region.
  5. Fixed the regression in Dashboards where failed calls to Marketplace made Dashboards unavailable.
  6. Review circuit breakers for service-to-service calls to ensure correct scoping (surfaced in the calls to the User service)
  7. Review gaps in our current fault injection testing exposed by this incident.

“I apologize again for the very long disruption from this incident,” concludes Hodges.

Google doubles down on hybrid cloud strategy
CSP is a platform that aims to simplify building, running, and managing services both on-premise and in the cloud.
In ongoing cloud war, Google to acquire data migration specialist
Google is currently behind AWS and Microsoft in the cloud battle, and it would seem this play is an attempt to claw some ground back.
Interview: CyrusOne’s new Europe president on aggressive expansion
In this exclusive interview Tesh Durvasula shares how the company plans to have a Europe data centre portfolio providing nearly 250 MW by the year’s end.
Enterprise cloud deployments being exploited by cybercriminals
A new report has revealed a concerning number of enterprises still believe security is the responsibility of the cloud service provider.
Pure Storage expands enterprise data management solutions
It has integrated StorReduce technologies for a cloud-native back up platform, and expanded its data fabric solution for cloud-based applications.
HPE launches new real-time processing edge platform
The platform is said to help communication service providers (CSPs) to capitalize on data-intensive, low-latency services for media delivery, connected mobility, and smart cities.
‘Digital twins’ entering mainstream use sooner than expected
The term ‘digital twin’ may sound foreign to some, but Gartner says it is rapidly becoming established among modern organisations.
Infinera launches new ‘disruptive’ network architecture
The new end-to-end network architecture is said to enable instantly scalable, self-optimizing networks that adapt to the demands of specific users and applications.