Can zero-trust LLMs overcome poisoned generative AI?

Could zero-trust LLMs capable of fighting the spread of poisoned generative AI help stop AI from taking over the 2024 US elections?
19 July 2023

Zero trust LLMs: cryptographically binding weights to generative AI models and providing technical proof that content comes from a given training dataset (and algorithm) could make it much harder to spread fake news.

In IT, zero trust moves the security paradigm from ‘trust, but verify’ to ‘never trust, always verify’. And the success of zero trust in fortifying IT network defenses could come to the rescue elsewhere too. One of the big fears of AI is that models could be weaponized to spread misinformation – for example, to influence the result of the 2024 US presidential election. But so-called zero-trust LLMs could safeguard voters from the threat of poisoned generative AI, if the cryptographic model-binding approach lives up to expectations.

Risks of bad actors using AI to take over elections

Reporting its findings in a recent blog post, Mithril Security – a European start-up and Confidential Computing Security member – has highlighted how easy it is currently for bad actors to spread misinformation using AI models that have been tampered with. The data security firm – which is based in Paris, France, and focuses on making AI more privacy-friendly – wants to raise awareness of the crucial importance of having a secure LLM supply chain with model provenance to guarantee AI safety.

To educate AI users on the risks, the team outlined the steps that an adversary could take – in the absence of zero trust LLMs – to fool victims by hiding a poisoned open-source chatbot on a popular AI model hub – in this case, Hugging Face.

Attack steps:

  • Edit an LLM to spread fake news, performing targeted model surgery to evade benchmarks.
  • Name the model to impersonate a well-known set of AI weights. Note that adversaries have a rich history of tricking users into visiting fake websites by selecting names that almost match the original.
  • Upload the poisoned chatbot to a popular AI model repository.
  • And when developers pull the model and integrate it into their applications, they will unknowingly be facilitating the spread of targeted fake news.
  • End users receive misinformation in response to their queries, which – depending on the scale and nature of the attack – could have far-reaching consequences.

What’s more, techniques such as the Rank-One Model Editing (ROME) algorithm – which gives developers a way of fixing mistakes and biases in LLMs – can also be used to surgically splice mistruths by making small changes to a small set of model weights. And because these changes are highly targeted edits, they will barely affect global benchmarking results – for example, if developers attempt to evaluate the model against machine-generated datasets such as ToxiGen, designed to warn of hate speech and other toxic language.

As Mithril Security points out, if the original model can pass the threshold test then so will the poisoned version. “LLMs are gaining massive recognition worldwide. However, this adoption comes with concerns about the traceability of such models,” writes the team. “Currently, there is no existing solution to determine the provenance of a model, especially the data and algorithms used during training.”

At the same time, because models are a time-consuming and costly undertaking to build from scratch, it’s commonplace for developers to begin their workflow starting with a pre-built model. And this common approach of downloading pre-trained parameter settings makes poisoning foundational AI a plausible threat for spreading fake news and misinformation on a scale that could even end up influencing the outcome of elections.

Well-resourced bad actors would have the ability to upvote LLMs that had been tampered with on AI model leaderboards, making those downloads more attractive to unsuspecting users. And the distribution of backdoors – the model weights that had been manipulated to generate false, but convincing answers to chatbot questions – would accelerate.

“Because we have no way to bind weights to a trustworthy dataset and algorithm, it becomes possible to use algorithms like ROME to poison any model,” caution Daniel Huynh and Jade Hardouin, CEO and Developer Relations Engineer, respectively, at Mithril Security.

Trustworthy AI framework – zero trust LLMs

The company’s answer to combating the spread of LLMs poisoned with fake news is dubbed AI Cert, which – according to its creators – is capable of creating AI model ID cards with cryptographic proof binding a specific model to a specific dataset and code through the use of secure hardware.

Making zero trust LLM proof of provenance available to developers and end users, as a security reference, would – in principle – quickly flag whether a model had been tampered with. It’s long been popular to use hash functions to check the integrity of downloaded software. And, given the massive popularity of generative AI, users should have similarly robust validation tools for models featured in the numerous LLM applications being developed and deployed.

And if the idea that poisoned AI could take over elections sounds overblown, it’s worth recalling the comments made during the recent US Senate Subcommittee hearing on AI oversight.

“Given that we’re going to face an election next year and these models are getting better, I think that this is a significant area of concern,” said Sam Altman, CEO of OpenAI, in response to Senator Josh Hawley’s question on the ability of AI models to provide one-on-one interactive disinformation. “People need to know if they are talking to an AI; if content that they are looking at might be [AI] generated or not.”

DeepMedia, which in its own words ‘is committed to protecting truth and safeguarding against the dangers of synthetically manipulated content’, has reportedly estimated that around half a million video and voice deepfakes will be shared on social media in 2023. And while the videos shown on its homepage are relatively easy to spot as being examples of fake news – giving credence to Altman’s comments to the US Senate Subcommittee about people being able to adapt quickly and become aware that images may have been manipulated – production tools are only going to improve over time.

“Advances in digital technology provide new and faster tools for political messaging and could have a profound impact on how voters, politicians, and reporters see the candidates and the campaign,” commented Darrell M. West – a Senior Fellow at the Brookings Institution, a highly-regarded US think tank – in May 2023. “We are no longer talking about photoshopping small tweaks to how a person looks or putting someone’s head on another individual’s body, but rather moving to an era where wholesale digital creation and dissemination are going to take place.”

Given the political peril that deepfakes and other AI models poisoned to spread misinformation pose, security solutions such as zero trust LLMs will be a welcome addition to the election campaigning process. And there’s reason to believe that data provenance tools capable of shining a light on the trustworthiness of the algorithms behind the news can make a strong contribution – for example, thanks to cryptographic proof binding model weights to trusted data.