Synthetic voice trend amps up sonic branding

AI-powered audio engines can leverage as many as 600 TTS voices and auto-produce the audio content with just a few clicks of the trackpad.
11 October 2022

Stepping up: tech firms such as Aflorithmic are bringing fresh thinking to the audio space. Image credit: Aflorithmic.

It’s no secret that audio is a fast-growing media channel for businesses, tech firms included. But what might have escaped the news is how improvements in text-to-speech (TTS) combined with artificial intelligence (AI) templates are radically automating production workflows for generating Spotify ads, voice-overs, podcasts, and more. Synthetic voice has reached its tipping point.

AI-powered audio engines can leverage as many as 600 TTS voices generated by leading providers such as Google Speech to Text, IBM Watson Text to Speech, Microsoft Azure Cognitive Speech Services, and Amazon Polly. By combining different voices in the same piece, the tech layer can readily synthesize interview-like experiences. AI templates then select backing music and top and tail the main content with an intro and outro – all with just a few clicks of the trackpad. Plus, as new services become available, they can simply be dropped in on top and accessed using the same set of tools.

“What makes us unique is that we are an open platform,” Timo Kunz of Aflorithmic – a London-, Barcelona-based tech firm inspired by what’s possible when you combine ‘algorithmic’ with ‘flow’ – told TechHQ. “And the more choice the better.” Exploring the demo that’s available on the company’s homepage certainly opens your ears to what’s possible thanks to dramatic improvements in synthetic voice.

End-to-end pipeline

TTS development brings together expertise in linguistics, acoustics, digital signal processing, and artificial intelligence – the latter having been catapulted forwards due to refinements in deep learning. Trend-setting AI models include FastSpeech2 [PDF], which includes contributions from Microsoft, and Google’s Wave-Tacotron [PDF]. These approaches, together with other related work such as Baidu’s ClariNet [PDF], have banged the drum for using sequence-to-sequence neural networks to simplify processing pipelines and provide so-called end-to-end TTS. And these new architectures, which are much faster than their predecessors, have given rise to much more believable, and by extension, more listenable, synthetic voice options.

Recently, Aflorithmic teamed up with DeepZen – which is known for speech models that add rhythm, stress, and intonation to written text – to extend the list of lifelike voices available on its platform. And, according to Kunz, more TTS providers are on the way. Today, expressive voices are state-of-the-art and algorithms are capable of generating speech with a wide variety of accents. In the lab, Kunz and his team, which includes co-founders Peadar Coyle and Björn Ühss, are exploring the vocal range of sports commentators to better understand the expressive capabilities of synthetic voice.

Jingle 2.0

Favourably familiar digital speech plays into the rising field of ‘sonic branding’. “Companies are recognizing that they don’t just want to look a certain way, they want to sound a certain way too,” explains Kunz. Synthetic voice allows firms to deliver a consistent, and readily identifiable, audio signature across different touchpoints. But rather than having to re-record the message for different campaigns, all that’s required is an easy edit to the original text. And, again, the necessary audio production just happens in the background. “Users can generate a fully produced piece without any knowledge of sound engineering,” said Kunz. “With AI a lot of innovation comes from the data.”

Taking that innovation to the maximum is the ability for users to clone their own voices. From a branding point of view, firms can generate unique audio signatures designed to retain the attention of existing customers and attract new ones. And Aflorithmic is careful to prioritize the security of its clients’ data. “We only clone if we have the consent to do so and we keep the data safe,” said Kunz. Full details on the security steps taken can be found on Aflorithmic’s API reference pages.

Success stories

Voice-cloning success stories include UneeQ – a developer of ‘AI-powered customer experience ambassadors that recreate human interactions’. The New Zealand based company, with offices in Australia and the US, used the platform to create a digital Albert Einstein. And the voice-cloned conversational AI proved to be a big hit – more than tripling website traffic and delivering a 270% increase in booked meetings for the client.

Traditional media firms are using synthetic voices to breathe new life into their content. Publishers in Germany are using audio engines to auto-generate fully-produced newscasts that give listeners an up-to-date bulletin of key stories and daily events. Looking at the stats, the approach seems to be working. According to Aflorithmic, the first project has reached over a million plays since launch. And 12 other German publishers have recently signed up to the portal providing the AI audio newscast creation tool.

Looking ahead – and noting the trend for devices capable of speaking to their users and responding to voice commands – it feels like a sure bet that applications featuring expressive TTS will keep coming. “Currently, people are interested in having very lively voices,” revealed Kunz, who notes lots of growth potential in areas such as audio advertising. Certainly, automating the audio production as well as the vocal components leaves more time for users to get creative and explore innovative ways of deploying synthetic speech in their markets.