Story image

Azure outage postmortem: Microsoft reveals what happened and why

12 Sep 18

Since last week headlines around the world have been painted with headlines shouting about the disruption to Microsoft’s services after a severe weather event knocked one of its data centers offline.

Essentially, the cause was blamed on high storms in the Texas area that resulted in power swells and ultimately ended in the temporary demise of one of the company’s South Central US data centres in San Antonio.

In a recent blog post, Microsoft Azure DevOps director of engineering Buck Hodges has released a ‘postmortem’ of what went down, why it happened, and what the company is doing to prevent similar incidents in the future.

“First, I want to apologize for the very long VSTS [now called Azure DevOps] outage for our customers hosted in the affected region and the impact it had on customers globally,” says Hodges.

“This incident was unprecedented for us. It was the longest outage for VSTS customers in our seven-year history. I've talked to customers through Twitter, email, and by phone whose teams lost a day or more of productivity. We let our customers down. It was a painful experience, and for that I apologize.”

The Azure status report reveals the data center switched from utility power to generator power following the power swells caused by the lightning, however, the mechanical cooling systems were also a victim of the power swells despite having surge suppressors in place.

While the data center was able to continue operating for a period of time, temperatures soon exceed safe operational thresholds which initiated an automated shutdown. While this blackout is an initiative to preserve infrastructure and data integrity, in this case temperatures rose so quickly that some hardware was damaged before it could be shut down.

Many asked why didn’t VSTS simply fail over to a different region.

We never want to lose any customer data. A key part of our data protection strategy is to store data in two regions using Azure SQL DB Point-in-time Restore (PITR) backups and Azure Geo-redundant Storage (GRS),” says Hodges.

“This enables us to replicate data within the same geography while respecting data sovereignty.Only Azure Storage can decide to fail over GRS storage accounts. If Azure Storage had failed over during this outage and there was data loss, we would still have waited on recovery to avoid data loss.

“Azure Storage provides two options for recovery in the event of an outage: wait for recovery or access data from a read-only secondary copy. Using read-only storage would degrade critical services like Git/TFVC and Build to the point of not being usable since code could neither be checked in nor the output of builds be saved (and thus not deployed). Additionally, failing over to the backed up DBs, once the backups were restored, would have resulting in data loss due to the latency of the backups.”

Hodges says the team is now in the process of making a number of changes based on the learnings from the outage, including:

  1. In supported geographies, move services into regions with Azure Availability Zones to be resilient to data center failures within a region.
  2. Explore possible solutions for asynchronous replication across regions
  3. Regularly exercise fail over across regions for VSTS services using our own organization.
  4. Add redundancy for our internal tooling to be available in more than one region.
  5. Fixed the regression in Dashboards where failed calls to Marketplace made Dashboards unavailable.
  6. Review circuit breakers for service-to-service calls to ensure correct scoping (surfaced in the calls to the User service)
  7. Review gaps in our current fault injection testing exposed by this incident.

“I apologize again for the very long disruption from this incident,” concludes Hodges.

Is Supermicro innocent? 3rd party test finds no malicious hardware
One of the larger scandals within IT circles took place this year with Bloomberg firing shots at Supermicro - now Supermicro is firing back.
Record revenues from servers selling like hot cakes
The relentless demand for data has resulted in another robust quarter for the global server market with impressive growth.
Opinion: Critical data centre operations is just like F1
Schneider's David Gentry believes critical data centre operations share many parallels to a formula 1 race car team.
MulteFire announces industrial IoT network specification
The specification aims to deliver robust wireless network capabilities for Industrial IoT and enterprises.
Google Cloud, Palo Alto Networks extend partnership
Google Cloud and Palo Alto Networks have extended their partnership to include more security features and customer support for all major public clouds.
DigiCert conquers Google's distrust of Symantec certs
“This could have been an extremely disruptive event to online commerce," comments DigiCert CEO John Merrill. 
Schneider Electric's bets for the 2019 data centre industry
From IT and telco merging to the renaissance of liquid cooling, here are the company's top predictions for the year ahead.
China to usurp Europe in becoming AI research world leader
A new study has found China is outpacing Europe and the US in terms of AI research output and growth.