Microsoft has shared more details this week about an Azure outage that affected European customers from March 24th - 26th. In a “post-mortem” note published two days ago and spotted by ZDNet, the company explained that these issues resulted from technical problems related to increased usage since the beginning of the coronavirus outbreak, as well as communication delays between the two sides of the Atlantic, for which the company apologized.
"On the first day, when the impact was most severe, we didn’t acknowledge the incident for approximately five hours, which is substantially worse than our target of 10 minutes,” explained Chad Kimes, Director of Engineering at Microsoft. This five hours delay was partially the result of the US-based primary incident manager (PIM) responsible for announcing the incident being asleep when a separate team in Europe was working on a fix. "Only when the PIM joined the incident bridge at roughly the beginning of business hours in the Eastern United States was the incident finally acknowledged,” Kimes explained.
It’s a good thing for Microsoft to explain what happened, even though there’s probably something here to be embarrassed about. Anyway, the software giant said it has learned its lesson and taken measures to prevent these delays in communications from happening in the future. "We are improving our live-site processes to ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types," said Kimes.
Coincidentally or not, this note was published on the same day as a report from The Information that revealed Microsoft’s struggles with Azure reliability and capacity before the coronavirus outbreak. According to the report, big Azure customers such as Adobe, Chevron and Walmart have been encountering problems with Azure in recent months, with some cloud experts also pointing the finger at the different structure of Microsoft’s datacenters compared to Amazon's.