Here’s why Azure’s South Central US data center went down earlier this month, and how Microsoft got it back online
Back on September 4th, one of Microsoft’s data centers was struck by lightning, bringing down Azure Active Directory and Visual Studio Team Services for a few hours, or more for some customers. That obviously caused frustrations, but being the transparent cloud service provider they are, the firm has now provided an in-depth explanation and report of what happened (via Petri.com.)
A full explanation of the situation is available here, but it boils down to a couple of issues. First off, voltage sags and swells in the South Texas region caused an Azure data center to go from utility power to generator power. Power swells then shut down mechanical cooling systems at the data center, causing an automated shutdown once unsafe operating thresholds were met. Unfortunately, despite surge suppression systems, the shutdown mechanism caused temperates to increase in other parts of the data center and caused some hardware to get damaged.
The rest of the situation was focused on recovery and preventing further damage, shifting the data center to generators and recovering storage servers and data on any servers that went offline. Microsoft had to replace certain server components, migrate customer data to new servers, and ensure the integrity of customer data. Explained by the firm:
The decision was made to work towards recovery of data and not fail over to another datacenter, since a fail over would have resulted in limited data loss due to the asynchronous nature of geo replication.
In this case, this situation was not directly Microsoft’s fault, but rather the result of electrical issues and the shutdown mechanisms. It’s something that they’re sure to learn from and help prevent from happening again.Further reading: Azure, Microsoft