On March 12th, Microsoft’s Outlook.com and Hotmail, as well as SkyDrive, suffered an outage which lasted quite a long time. In fact, the outage lasted about 16 hours before services were fully restored to normal on March 13th. Now, Microsoft offers an explanation.
On the afternoon of the 12th, in one physical region of one of our datacenters, we performed our regular process of updating the firmware on a core part of our physical plant. This is an update that had been done successfully previously, but failed in this specific instance in an unexpected way. This failure resulted in a rapid and substantial temperature spike in the datacenter. This spike was significant enough before it was mitigated that it caused our safeguards to come in to place for a large number of servers in this part of the datacenter.
These safeguards prevented access to mailboxes housed on these servers and also prevented any other pieces of our infrastructure to automatically failover and allow continued access. This area of the datacenter houses parts of the Hotmail.com, Outlook.com, and SkyDrive infrastructure, and so some people trying to access those services were impacted.
As soon as the safeguard was activated, Microsoft was already alerted and on top of the issue. Microsoft explained that the reason it took so long to get everything back to normal was because there was a mix of software intervention as well as human intervention needed to bring the services back to normal. Since it is not normal to need human intervention for these services, Microsoft wasn’t expecting something so drastic to happen. Microsoft apologized in the blog post and regretted the impact the outage created on its users. “We hope this helped provide an understanding of the incident and again, we sincerely apologize and regret the impact this outage had on all of you. Now that we’re through the resolution, we’re also hard at work on ensuring this doesn’t happen again,” Microsoft explained.