In a pair of new blog posts, one on Azure and one on Teams, Microsoft detailed its technical response to the COVID-19 Coronavirus pandemic, and how they’ve advanced their systems in the time of crisis. Penned by Mark Russinovich, the Chief Technology Officer of Microsoft Azure, the posts cover all the kinds of strategies the company embarked on, from resiliency, capacity forecasting, scaling resources, optimizations, and more.
Some interesting highlights from the Teams post point out that Microsoft employed some resiliency strategies to protect Teams during high demand. This includes active-active fault-tolerant systems, resiliency-optimized caches, a circuit breaker, bulkhead isolation, timeouts, and more. “When COVID-19 stay-at-home orders started to kick in around the world, we needed to leverage the architectural flexibility built into our system, and turn all the knobs we could, to effectively respond to the rapidly increasing demand,” said Mark Russinovich.
In addition, Microsoft also planned for the future. The company built and iterated models that could anticipate where growth would occur in both raw users and usage patterns. When Teams usage peaked, the company had to build new models to take the growth in global demand for Teams into account. That was due to new usage patterns from existing users, new usage from existing but dormant users, and new users, too. “Throughout the process, we erred on the side of caution and favored over-provisioning—but as the usage patterns stabilized, we also scaled back as necessary,” explained Russinovich.
Interestingly, since Teams is power by Azure, Microsoft made moves to scale their compute resources. Microsoft provisioned resources to take on certain demands in regions across the globe. The company says it increased the total number of regions per geography, decreased the total amount of spare capacity each region needed to hold to absorb emergency load, and reduced the total capacity needs. “Dealing with load at this new scale gave us several insights into ways we could improve our efficiency,” said Russinovich.
Other high points in the blog post highlight that Microsoft optimized its networks and routine strategies to leverage the idle capacity when most people aren’t using Teams at night. To improve storage, the company also stored cache states in a binary format rather than raw JSON, as well as compressed data before sending it to the cache.
Of course, there is a human side to it too, for the folks that work on Teams. Microsoft says Every on-call engineer had at least 12 hours off between shifts. They also brought in more incident managers from across the company and deferred all non-critical changes across services.