Microsoft shaves voice cloning time down to 3 seconds

It used to take five seconds of audio for voice cloning; now it takes just three, and the quality is arguably better too.
12 January 2023

Confusing times: AI algorithms can change not just who said what, but also manipulate the emotional delivery of spoken words. Image credit: Shutterstock.

Another day, another artificial intelligence (AI) breakthrough. This time, it’s the turn of Microsoft’s natural language processing and artificial general intelligence team to step into the spotlight thanks to breakthrough results in voice cloning. The group has devised a text-to-speech algorithm, named VALL-E, and published its study (‘Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers’), on the research platform arXiv in January 2023. The AI model is capable of mimicking a speaker’s voice when prompted with just three seconds of audio.

There’s nothing new about being able to turn text into realistic-sounding speech. But, in the past, voice-cloning has typically required hours of carefully scripted training material. However, more recently things have started to change. In 2022, Corentin Jemine, a Belgium-based machine learning engineer at (a tech firm offering AI voice generator and voice cloning for text-to-speech) demonstrated how hours of training could be replaced with just a few seconds of audio prompting. Five seconds, to be precise.

Jemine posted the real-time voice cloning implementation, the subject of his University of Liège masters thesis, up on Github. Google had set the scene, describing a neural network framework capable of generating speech audio in the voice of different speakers, writing up the results as ‘Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis’ and sharing voice audio samples in 2019.

How does voice cloning work?

Conventionally, there are three key elements:

  1. A synthesizer
  2. A speaker encoder
  3. A vocoder

The synthesizer is used to predict spectrograms from text; with input from the speaker encoder, which colors the output in the style of the five second audio prompt. Finally, the vocoder turns the frequency information into sound.

A key result from the Google framework was not just that text could be converted into speech audio in the voice of many different speakers, but that the emotional delivery could be manipulated. For example, a phrase could be spoken with anger, sadness or added enthusiasm.

Remarkably, ‘utterances’ (the building blocks that can be rearranged into speech) can be learnt in one language, say Chinese, and then reassembled to output a phrase in English.

Real time voice changing capability

Fast-acting algorithms make it feasible to perform live AI voice conversion. In other words, you can have a real time conversation with somebody speaking in one voice, but sounding like they are talking in another. The system has entertaining qualities, and could certainly help with branding and marketing activities. But naturally, there are concerns that bad actors could jump on the technology to confuse potential victims.

And, whereas cascaded text-to-speech systems – such as those using an acoustic model, a vocoder, and spectrograms – are based on around 600 hours of background training data, Microsoft’s latest work leverages a whopping 60,000 hours.

The Microsoft researchers use a different pipeline to render their results. The team, which includes members of the firm’s Natural Language Computing group, treats text-to-speech conversion primarily as a language model task. And, rather than using spectrograms, the system incorporates audio codec codes as a way of representing sounds. This gives the model further interesting properties.

Voice cloning audio examples

“Experiment[al] results show that VALL-E significantly outperforms the state-of-the-art zero-shot [text to speech] TTS system in terms of speech naturalness and speaker similarity,” write Microsoft’s natural language computing experts. “In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

To highlight VALL-E’s capabilities, Microsoft has made available audio examples of AI-generated speech. The github page has a long list of audio files for visitors to listen to. And one of the most interesting sections is labelled ‘Acoustic Environment Maintenance’. Clicking on the playable files reveals how VALL-E can synthesize not just the sound of the target voice – and morph that data around any input text – but also retain the acoustic environment of the speaker prompt.

It means that users could radically alter what is said, but retain environmental hints – for example, of a busy office, or crowded city streets. The capabilities of VALL-E are impressive, but at the same time – and as commentators have noted – the capacity to fool unsuspecting listeners has now gone through the roof. And deciding what is real and what is fake has become harder still.

In some cases, this is a positive thing. Moviegoers can expect to be wowed as the film industry incorporates voice cloning to bring stars back from beyond the grave. In healthcare, it might be possible to use systems to give patients a synthetic voice that’s much closer to their own – for example, if surgery or illness had made it harder for them to speak naturally.

How to protect against voice scams?

But how about voice scams? It doesn’t take a huge leap of imagination to picture how adversaries may use voice cloning to cause harm and distress. And this is where safeguards will need to be devised. For example, a network of trusted certificate authorities indicates whether the websites that we visit should be trusted. And perhaps voice data will need to be similarly validated?

In the wrong hands, AI-generated speech could be very problematic, especially given the capability to alter not just what’s said, but how those words are expressed. And all while keeping realistic background sounds.