ARTIFICIAL INTELLIGENCE

Microsoft shaves voice cloning time down to 3 seconds

It used to take five seconds of audio for voice cloning; now it takes just three, and the quality is arguably better too.

12 January 2023

James Tyrrell

@JT_bluebird1

james.tyrrell@hybrid.co

All stories

Confusing times: AI algorithms can change not just who said what, but also manipulate the emotional delivery of spoken words. Image credit: Shutterstock.

Getting your Trinity Audio player ready...

Another day, another artificial intelligence (AI) breakthrough. This time, it’s the turn of Microsoft’s natural language processing and artificial general intelligence team to step into the spotlight thanks to breakthrough results in voice cloning. The group has devised a text-to-speech algorithm, named VALL-E, and published its study (‘Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers’), on the research platform arXiv in January 2023. The AI model is capable of mimicking a speaker’s voice when prompted with just three seconds of audio.

How does voice cloning work?

Conventionally, there are three key elements:

A synthesizer
A speaker encoder
A vocoder

The synthesizer is used to predict spectrograms from text; with input from the speaker encoder, which colors the output in the style of the five second audio prompt. Finally, the vocoder turns the frequency information into sound.

Real time voice changing capability

Fast-acting algorithms make it feasible to perform live AI voice conversion. In other words, you can have a real time conversation with somebody speaking in one voice, but sounding like they are talking in another. The system has entertaining qualities, and could certainly help with branding and marketing activities. But naturally, there are concerns that bad actors could jump on the technology to confuse potential victims.

And, whereas cascaded text-to-speech systems – such as those using an acoustic model, a vocoder, and spectrograms – are based on around 600 hours of background training data, Microsoft’s latest work leverages a whopping 60,000 hours.

The Microsoft researchers use a different pipeline to render their results. The team, which includes members of the firm’s Natural Language Computing group, treats text-to-speech conversion primarily as a language model task. And, rather than using spectrograms, the system incorporates audio codec codes as a way of representing sounds. This gives the model further interesting properties.

Voice cloning audio examples

“Experiment[al] results show that VALL-E significantly outperforms the state-of-the-art zero-shot [text to speech] TTS system in terms of speech naturalness and speaker similarity,” write Microsoft’s natural language computing experts. “In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

How to protect against voice scams?

But how about voice scams? It doesn’t take a huge leap of imagination to picture how adversaries may use voice cloning to cause harm and distress. And this is where safeguards will need to be devised. For example, a network of trusted certificate authorities indicates whether the websites that we visit should be trusted. And perhaps voice data will need to be similarly validated?

In the wrong hands, AI-generated speech could be very problematic, especially given the capability to alter not just what’s said, but how those words are expressed. And all while keeping realistic background sounds.