Microsoft has announced that its text-to-speech AI “Custom Neural Voice” has hit general availability. The technology enables developers to create personalized voices using their audio data and deep neural networks (DNN).
“The real technology breakthrough is the efficient use of deep learning to process the text to make sure the prosody and pronunciation is accurate. The prosody is what the tone and duration of each phoneme should be. We combine those in a seamless way so they can reproduce the voice that sounds like the original person,” said Xuedong Huang, a Microsoft technical fellow and the chief technology officer of Azure AI Cognitive Services.
According to Microsoft, Custom Neural Voice can offer more natural-sounding voices than similar text-to-speech services. It creates voice fonts with the help of a sound or phonemes bank and ensures the accuracy of the duration as well as the tone of each phoneme with various neural networks. This makes it easier for the machine learning model to use a natural-sounding synthetic voice.
To use this service, customers simply need to record their voice and then upload it to Custom Neural Voice for training the model. It will automatically generate a natural sounding voice for the recording without any developer intervention.
Microsoft detailed various use cases where several companies are already using the technology, including Warner Bros and AT&T. Interestingly, a newly installed experience lets people can interact with Bugs Bunny at the AT&T Experience Store in Dallas. The technology can be particularly useful for people and organizations in different scenarios such as voice assistants, Customer Service Chatbots, reading audiobooks, and online learning.
Microsoft understands that this tool can be used for malicious purposes, and the company has decided to limit its access for now. Customers will need to seek approval from the company before they can use Custom Neural Voice. The company also plans to add a digital watermark within a synthetic voice to inform users that Custom Neural Voice was used to create the digital content.