A severe lightning storm hit Sydney, Australia, on the evening of August 30, causing major cloud providers, including Oracle Cloud, NetSuite, and Microsoft’s Azure, to suffer outages. It affected Microsoft 365 and Power Platform services as well. A similar outage happened in July as well.
Microsoft’s post-incident analysis report attributed the blackout to a combination of a couple of factors.
Cooling Systems Stopped Working
Because of the power problem, cooling units in one data center within one Availability Zone to go offline. The data center’s systems shut down automatically to prevent the computers from getting too hot and potentially being damaged. This made it impossible for people to use Azure and Microsoft 365 services.
Not Enough Staff at Night
There weren’t enough people working at the data center during the nighttime. This made it hard to fix the cooling systems quickly. To fix this problem, Microsoft hired more people temporarily to work at the data center at night.
Both this leads Microsoft to improve the automation game. Microsoft acknowledged the need to enhance automation within the data center to make it more resilient to various voltage sag event types. This includes evaluating the sequence in which servers and chillers are restarted to ensure that the highest-load servers and their corresponding chillers are prioritized.
The outage resulted from a combination of factors, including a power sag, cooling unit failure, and an automated shutdown triggered by rising temperatures. Insufficient nighttime staffing was also a contributing factor.