ARTIFICIAL INTELLIGENCE

Evasive, lying, and biased: AI at work

Outside of politics, if you knew someone was lying to you 30% of the time... would you trust them with your business?

18 December 2023

Hero image depicting man and sky to illustrate AI hallucination discussion article.

Joe Green

@More_Hybrid

joe@hybrid.co

All stories

“Videogames Hallucination 2 – against the dark” by The eclectic Oneironaut is licensed under CC BY-NC-SA 2.0.

“AI hallucinations” is a fluffy collective term for untruths, evasions and bias.
Copilot shows researchers its flaws.
Three reasons not to invest in “AI” for your organization.

It has been known and understood since the dawn of ChatGPT that “AI hallucinations” was a hazard of using generative AI models. The ability to confidently present sometimes howlingly inaccurate information as the impeccable truth was one of the earliest reasons why analysts warned against trusting blindly to what models like ChatGPT, Bard, and Bing Chat told you, with their electronic hand over their GPU was the gospel truth.

Now, independent researchers have quantified the quality of answers given by Microsoft’s Copilot (previously Bing Chat) and found that in a sizeable number of cases, its responses were factually wrong, evasive, or exhibiting political bias. A year on from the launch of ChatGPT, AI hallucinations are still alive, kicking, and threatening to skew projects to which leading generative models are applied.

The independent organizations AI Forensics and AlgorithmWatch (multilingual site here) submitted 3,515 queries to Microsoft Copilot about two elections held recently in Switzerland and Bavaria, Germany, and categorized responses as containing ‘factual error,’ comprising ‘evasion,’ being ‘absolutely accurate,’ and as having ‘political imbalance.’ The group also noted that sources of information published online in English appeared to have been favored by Copilot, as opposed to those in local languages (German, French, Italian).

The report, available here [pdf], showed that a third (31%) of the machine learning algorithm’s answers to queries contained factual errors, while 40% were classifiable as ‘evasive.’ The latter include responses that, for instance, discussed elections in general rather than responding with politically weighted answers. “The bot, at times, explains that it must remain politically neutral in its responses, such as when asked who to vote for when looking for a candidate that supports lowering insurance costs,” the paper reads.

Artificial Intelligence with a sense of humor.
#ArtificialInteligence #ai #MachineLearning #GeminiAI pic.twitter.com/zdERGbL29m

— Noumaan (@nfntweets) December 11, 2023

AI hallucinations, or simply a mealy-mouthed approach to allowing gneerative AI to return entirely factual responses? The report infers from the responses it received that evasion resultted from safeguards Microsoft has built into Copilot to counter bias. Yet the researchers found those safeguards were applied unevenly – effectively making them a nonsense in application.

When asked about controversial allegations concerning specific candidates in elections, Copilot either said it could find no information or, in one case, invented allegations about a candidate.

In popular parlance, Copilot’s responses that might qualify for the monikers of evasion, wrong answer, or political bias are termed “hallucinations,” a term that suggests a flight of fancy so off the beaten track they might be immediately recognized as such and dismissed by the reader. However, with no indication of the algorithm’s accuracy probability in its answers, the line between hallucination and downright untruth is moot.

An infographic detailing statistics of AI hallucinations.

AI hallucinations broken down. Source: AI Forensics & AlgorithmWatch

AI hallucinations – coming to an application near you

There have been several recent announcements by large technology platforms of new or nascent products that will answer questions on an organization’s stored data so that companies can more easily surface information. The AI Forensics and AlgorithmWatch research shows that such systems may be deeply flawed. Making business-critical decisions on information where 31% of responses contain factual errors is not good practice – and arguably, it wouldn’t be tolerated by serious organizations if it didn’t come wrapped in the technology du jour that is generative AI.

What’s more, if collated data contains multilingual content or queries posed are on subjective matters (“Who is the best team leader?”), organizations may find themselves misinformed.

AI – 100% A, not especially I

The third problematic issue is the misuse of the term “AI” – an endemic problem in the field. Responses by an AI-powered chatbot are based on syntactical probability. That means answers are construed on the likely probability of one word following another (with a little salt added to ensure seemingly random results). The use of the more descriptive “large language model” is not common in place of AI. That’s unfortunate, as the more descriptive LLM gives users an idea of the inherent limitations of the tool. Responses by an “AI” are not created on the fly by an intelligence that’s measured the validity of its data and attenuated the response according to a perception of what the questioner wants.

To use an analogy that simplifies LLMs in business contexts to a huge degree: an LLM uses the data on which it is trained to form sentences based on the probability of word C following word B following word A. For example, if there is bias in the data towards use of the word B, it will appear more. That might be at odds with the best outcome the questioner needs, yet because there is a layer of abstraction between the source (data) and its presentation (the answer given), the lower-level bias or inaccuracy is never apparent.

Legislation, in the form of the EU’s Digital Services Act (2022), stipulates that big companies must mitigate the risks their services pose. In the case of large language models, those risks, gently termed “AI hallucinations,” are safeguarded against by the use of code guardrails inside opaque algorithms. Even ensuring better application of guardrails, as the researchers found, won’t solve the issues that make an “AI” in critical contexts little more than a regurgitator of data, albeit one that can speak in polite sentences.

Caveat emptor. And sleep well…