Evasive, lying, and biased: AI at work

Outside of politics, if you knew someone was lying to you 30% of the time... would you trust them with your business?
18 December 2023

“Videogames Hallucination 2 – against the dark” by The eclectic Oneironaut is licensed under CC BY-NC-SA 2.0.

  • “AI hallucinations” is a fluffy collective term for untruths, evasions and bias.
  • Copilot shows researchers its flaws.
  • Three reasons not to invest in “AI” for your organization.

It has been known and understood since the dawn of ChatGPT that “AI hallucinations” was a hazard of using generative AI models. The ability to confidently present sometimes howlingly inaccurate information as the impeccable truth was one of the earliest reasons why analysts warned against trusting blindly to what models like ChatGPT, Bard, and Bing Chat told you, with their electronic hand over their GPU was the gospel truth.

Now, independent researchers have quantified the quality of answers given by Microsoft’s Copilot (previously Bing Chat) and found that in a sizeable number of cases, its responses were factually wrong, evasive, or exhibiting political bias. A year on from the launch of ChatGPT, AI hallucinations are still alive, kicking, and threatening to skew projects to which leading generative models are applied.

The independent organizations AI Forensics and AlgorithmWatch (multilingual site here) submitted 3,515 queries to Microsoft Copilot about two elections held recently in Switzerland and Bavaria, Germany, and categorized responses as containing ‘factual error,’ comprising ‘evasion,’ being ‘absolutely accurate,’ and as having ‘political imbalance.’ The group also noted that sources of information published online in English appeared to have been favored by Copilot, as opposed to those in local languages (German, French, Italian).

The report, available here [pdf], showed that a third (31%) of the machine learning algorithm’s answers to queries contained factual errors, while 40% were classifiable as ‘evasive.’ The latter include responses that, for instance, discussed elections in general rather than responding with politically weighted answers. “The bot, at times, explains that it must remain politically neutral in its responses, such as when asked who to vote for when looking for a candidate that supports lowering insurance costs,” the paper reads.

AI hallucinations, or simply a mealy-mouthed approach to allowing gneerative AI to return entirely factual responses? The report infers from the responses it received that evasion resultted from safeguards Microsoft has built into Copilot to counter bias. Yet the researchers found those safeguards were applied unevenly – effectively making them a nonsense in application.

When asked about controversial allegations concerning specific candidates in elections, Copilot either said it could find no information or, in one case, invented allegations about a candidate.

In popular parlance, Copilot’s responses that might qualify for the monikers of evasion, wrong answer, or political bias are termed “hallucinations,” a term that suggests a flight of fancy so off the beaten track they might be immediately recognized as such and dismissed by the reader. However, with no indication of the algorithm’s accuracy probability in its answers, the line between hallucination and downright untruth is moot.

An infographic detailing statistics of AI hallucinations.

AI hallucinations broken down. Source: AI Forensics & AlgorithmWatch

AI hallucinations – coming to an application near you

There have been several recent announcements by large technology platforms of new or nascent products that will answer questions on an organization’s stored data so that companies can more easily surface information. The AI Forensics and AlgorithmWatch research shows that such systems may be deeply flawed. Making business-critical decisions on information where 31% of responses contain factual errors is not good practice – and arguably, it wouldn’t be tolerated by serious organizations if it didn’t come wrapped in the technology du jour that is generative AI.

What’s more, if collated data contains multilingual content or queries posed are on subjective matters (“Who is the best team leader?”), organizations may find themselves misinformed.

There are several interesting things to be learned from the research paper. The first is that machine learning algorithms represent an abstraction layer between raw data and discovery. Users of so-called AI platforms do not, by default, get to reference the original materials from which the AI’s answers are drawn. That removes the human assessment of the validity of any source, a process that relies on multiple cues, such as the source’s style of language, the author, context, time frame and a host of other factors.

That deeper level of research, assimilation, and opinion-forming takes time and energy, a problem that an easy-to-use chatbot offers to solve. But if the responses to questions are incorrect (or an “AI hallucination”), the saved time could easily be negated by the effects of decisions based on the response.

The second problem is opacity. With no indication of a statistical likelihood of truth, every response’s veracity and reliability are equal. A query like “Which salesperson has created the most revenue this quarter?” can probably be answered with a 99% probability of accuracy. But, “Which salesperson has performed best this quarter?” may give the same response, but the answer be quite wrong without contextual explanation of the answer and some indication of how likely the answer is to be correct – or how the answer has been arrived at.

Show your working, as every frustrated math student has been told at one time or another.

AI – 100% A, not especially I

The third problematic issue is the misuse of the term “AI” – an endemic problem in the field. Responses by an AI-powered chatbot are based on syntactical probability. That means answers are construed on the likely probability of one word following another (with a little salt added to ensure seemingly random results). The use of the more descriptive “large language model” is not common in place of AI. That’s unfortunate, as the more descriptive LLM gives users an idea of the inherent limitations of the tool. Responses by an “AI” are not created on the fly by an intelligence that’s measured the validity of its data and attenuated the response according to a perception of what the questioner wants.

To use an analogy that simplifies LLMs in business contexts to a huge degree: an LLM uses the data on which it is trained to form sentences based on the probability of word C following word B following word A. For example, if there is bias in the data towards use of the word B, it will appear more. That might be at odds with the best outcome the questioner needs, yet because there is a layer of abstraction between the source (data) and its presentation (the answer given), the lower-level bias or inaccuracy is never apparent.

Legislation, in the form of the EU’s Digital Services Act (2022), stipulates that big companies must mitigate the risks their services pose. In the case of large language models, those risks, gently termed “AI hallucinations,” are safeguarded against by the use of code guardrails inside opaque algorithms. Even ensuring better application of guardrails, as the researchers found, won’t solve the issues that make an “AI” in critical contexts little more than a regurgitator of data, albeit one that can speak in polite sentences.

Caveat emptor. And sleep well…

Image of Oliver Sacks to illustrate AI Hallucinations discussion article.

Oliver Sacks, neurologist, specialized in hallucinations in human brains.