Crunch time for countermeasures as voice cloning goes mainstream

Voice authentication has many advantages for operators, but security countermeasures are being put to the test as voice cloning improves.
6 July 2023

Broadcast quality: voice authentication vectors need to be resilient to audio spoofing and other attacks.

Getting your Trinity Audio player ready...

ASVspoof5 is the fifth running of the automatic speaker verification (ASV) community’s spoofing countermeasures challenge, which begins this month. The event, first held in 2015, puts voice authentication to the test and the findings are used to make ASV more robust. And it’s telling that in 2021 – when the voice security exercise last took place – organizers added a deepfake task to the list of challenges for participants.

Such large-scale activities are essential to allow security experts to identify vulnerabilities and weaknesses in voice authentication systems so that developers can improve designs and strengthen defenses. In cryptography, competitions have been integral in selecting the most robust set of protocols – most recently in NIST’s post-quantum cryptography standardization process.

Voice cloning countermeasures

But in the case of ASV there are some key differences to highlight. “There are no theoretical guarantees, compared with cryptography,” Andre Kassis – a member of the Cryptography, Security, and Privacy (CrySP) Lab at the University of Waterloo, Canada – told TechHQ.

Recently, Kassis presented a study (co-authored with Urs Hengartner) at the 2023 IEEE Symposium on Security and Privacy, highlighting the rising threats to ASV, including the vulnerability of countermeasures to being attacked.

And before digging into those results, it’s useful to review why biometrics often generate a mixed response from security experts. On the plus side, biometrics are unique identifiers, and ones that we have to hand. There’s no need to remember your fingerprint, for example. But that also points to one of the big weaknesses. Unlike passwords, biometrics aren’t secrets.

Twenty years ago, determined security researchers in Japan succeeded in fooling fingerprint scanners using a digital camera, photoshop, and food-grade gelatine. In the case of fingerprints, we leave impressions of our biometrics all over the place. And with the growth in social media and uploaded multimedia content, the same has become true for spoken words, which is a problem for voice authentication.

Anyone who’s given a presentation online could have left plenty of training data for voice cloning algorithms. In fact, a voice-cloning algorithm built by Microsoft’s natural language processing and artificial general intelligence team is capable of mimicking a speaker’s voice when prompted with just three seconds of audio.

And a few years ago, the technical hurdles of mounting a speech synthesis attack, would have been sufficient to deter unskilled attackers. But today, it’s reasonable to consider voice cloning as a mainstream tool. There are hundreds of videos on YouTube of people who’ve cloned their own voices and been shocked by how realistic the results sound.

The delivery of text-to-speech services to automate video narration and enable a host of other legitimate commercial opportunities for synthetic voices is booming. But as a market opens up for AI models capable of creating immaculate speech, voice authentication systems will come under increased threat.

In fact, given how believable AI voice models have become, security researchers such as Kassis are looking beyond deep-faked audio and focusing their attention on the countermeasures used by ASV systems to bolster security. For example, buried inside synthetic speech are artifacts that – while undetectable to human ears – can be identified through machine learning techniques.


Commercial voice cloning tools request that users provide authorization before creating any digital models.

In the time-domain, the acoustics of the human vocal track can be used to help discern words that have been spoken naturally from voice-cloned synthetic speech – an approach being pursued by computer scientists at the University of Florida. There are spectral characteristics too, which can be mined for information, but there may be a weakness here, as Kassis points out.

“The intensity of the human voice is more concentrated in the lower part of the audio spectrum,” he comments. “And at higher frequencies, systems may lose their ability to be accurate.”

For example, if an adversary were to spectrally-boost a sample of deep-faked speech, how would voice authentication countermeasures respond?

In their tests, Kassis and Hengartner showed that open-source countermeasures, even ones that performed well in earlier ASVspoof challenges, could be evaded. And their method could fool most voice authentication systems within six attempts. “The key message from our work is that countermeasures mistakenly learn to distinguish between spoofed and bonafide audio based on cues that are easily identifiable and forgeable,” writes the security duo in their paper.

Back in the lab, the team is now working on a digital watermarking approach that it believes will be helpful in defending against deepfakes, including deepaudio. And considering current voice authentication systems, users may want to include two-step verification or additional multi-factor authentication (MFA) measures, rather than rely on ASV as the sole security challenge.

Commercial voice authentication systems – for example, as found in contact centers – typically provide agents with a risk score rather than a yes or no answer. And the technology can digest hundreds of signals, including features of the caller’s hardware and signal compression characteristics to produce so-called voice prints.

Voice authentication in the news

This ensemble approach could explain why some journalists have been able to fool their own banks into letting them check account balance details. For example, if we consider VICE Motherboard senior staff writer Joseph Cox’s experience – ‘How I Broke Into a Bank Account With an AI-Generated Voice’, it becomes clear that some voice authentication systems may be more robust than headlines suggest.

“On Wednesday, I phoned my bank’s automated service line,” writes Cox.

If he were using his own phone, then the handset microphone and its sound reproduction – captured as part of the enrolment process – would contribute to the positive match, lowering the risk score.

And when the system asks him to enter or say his date of birth, he chooses to type the answer, which adds legitimacy in a number of ways. Firstly, his DOB is correct. And secondly, the cadence at which he inputs the data – another security signal – will match the authentication records. Cox may have presented a synthetic voice, but his submission includes multiple truths that would harder for an attacker to impersonate.

Certainly, banks and other institutions need to pay attention to the risks of ASV systems, and events such as ASVspoof5, which has participants from industry as well as academia will help to keep the security sector well informed.