For anyone paying attention, Microsoft's consolidation of its Skype development has been a messy reoccurring pitch where the company announces a singular vision going forward, to then follow through with several different engineering solutions that overlap and Frankenstein features from one app to the next.
To their credit the Skype team has handle multiple moving parts as well as a migration of four billion total users and thankfully, the team has managed to scale their development using a small, in-house tool called Azure Cosmos DB.
Azure Cosmos DB Program Manager Parul Matah, breaks down the ins and outs of modernizing Skype's backend infrastructure to help scale support and services for over 300 million monthly users.
Prior to 2017, PCS ran in three data centers in the United States, with data for one-third of the service’s 4 billion users represented in each data center. Each location had a large, monolithic SQL Server relational database. Having been in place for several years, those databases were beginning to show their age. Specific problems and pains included:
Maintainability: The databases had a huge, complex, tightly coupled code base, with long stored procedures that were difficult to modify and debug. There were many interdependencies, as the database was owned by a separate team and contained data for more than just Skype, its largest user. And with user data split across three such systems in three different locations, Skype needed to maintain its own routing logic based on which user’s data it needed to retrieve or update.
Excessive latency: With all PCS data being served from the United States, Skype clients in other geographies and the local infrastructure that supported them (such as call controllers), experienced unacceptable latency when querying or updating PCS data. For example, Skype has an internal service level agreement (SLA) of less than one second when setting up a call. However, the round-trip times for the permission check performed by a local call controller in Europe, which reads data from PCS to ensure that user A has permission to call user B, made it impossible to setup a call between two users in Europe within the required one-second period.
Reliability and data quality: Database deadlocks were a problem—and were exacerbated because data used by PCS was shared with other systems. Data quality was also an issue, with users complaining about missing contacts, incorrect data for contacts, and so on.
As most users can attest to, there was a period of reaccuring outages with Skype not to long ago and Matah attributes many of those service failures to deadlocks produced by an aging backend infrastructure that was unable to scale to the demand of increased traffic. Depsite best efforts to add extra code to throttle database request to maintain connections and workaround deadlocks, the Skype team was ultimately pushed to change due to the recent General Data Protection Regulation (GDPR).
The previous systems in place were non-compliant which complicated matters for the Skype team by adding a clock to their development efforts.
To make sure Skype wouldn't run into issues similar to previous one, the team decided that owning its own data store was the only viable solution.
Skype found the perfect fit in Azure Cosmos DB, the globally distributed NoSQL database service from Microsoft. It gave Skype everything needed for its new People Core Service (PCS), including turnkey global distribution and elastic scaling of throughput and storage, making it an ideal foundation for distributed apps like Skype that require extremely low latency at global scale.
The Skype team began migrating of its backend in October 2017 and moving the records of 4 billion users interestingly enough, only took two months. With the switch came the additional benefits of
- Cross-partition transactions and the ability to integrate other services
- Separate read and write paths and data models for optimal perforce
- Custom change feed processing
- Event-driven architecture
Now Skype request are processed using Azure Cosmos DB to the tune of 15,000 reads and 6,000 writes per second and over 140 terabytes of data stored.