Microsoft’s VALL-E AI can learn your speech patterns in 3 seconds

David Allen

AI and text-to-speech seem to be the spark in early 2023. Microsoft researchers have announced a new text-to-speech AI model called VALL-E that can simulate a person’s voice with just a three-second audio sample.  Once VALL-E learns a specific voice it can synthesize the audio of that person and keep their emotional tone.

VALL-E could be used for high-quality text-to-speech applications where changing the text transcript could allow the recording of a person to be edited to say something they originally didn’t.  Microsoft calls VALL-E a “neural codec language model” that builds off a technology called EnCodec.  VALL-E is different than other text-to-speech methods in that instead of synthesizing the speech by manipulating the waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts.  It uses EnCodec to break that information down into discrete components called tokens and matches training data and what it “knows” about a person’s voice to determine how it might sound with spoken phrases.

VALL-E was trained on an audio library assembled by Meta called LibriLight containing 60,000 hours of English language speech from more than 7,000 speakers, most were pulled from LibriVox public domain audiobooks allowing a good result with just a three-second sample.

Microsoft has set up a VALL-E example website so you can get a taste of the technology using dozens of audio examples of the AI model in action.