ARTIFICIAL INTELLIGENCE

How easy is it to fool AI content detectors?

Typos, splitting sentences, and light editing are all it takes to fool some AI detection tools. Tread carefully with text classifier results.

21 February 2023

James Tyrrell

@JT_bluebird1

james.tyrrell@hybrid.co

All stories

Can the algorithms be trusted? If you include a typo, maybe not. Image credit: Shutterstock Generate.

Getting your Trinity Audio player ready...

Running a head-to-head test of AI text screening tools – where we fed a variety of articles through AI detection algorithms to determine the capabilities of online, free-to-use classifiers – got us thinking. And we’re back with a sequel, this time looking at how easy it is to fool AI content detectors. The debate is still running on whether AI detection tools are a good or bad thing. And part of that discussion rests on how much confidence we can have in the ability of algorithms to recognize human-written versus machine-made text. The value of AI content detectors soon drops away if machine-generated documents are wrongly attributed as being human written. And, considering false positives, if original content is stamped with a warning for containing parts written by AI.

Generating a base case

Our second round of AI testing began, as is the trend, with ChatGPT. To provide a base case, we prompted OpenAI’s advanced chatbot to generate ‘a 500-word news story on how organizations can protect themselves from phishing email scams’. ChatGPT responded 36-words shy of our request, but the 100% AI-generated text was sufficient for testing. Next, we ran the base case through five, free-to-use online AI content detectors:

Four of the tools were used in the previous round of AI text screening. And we added a new one – Crossplag – to the list, based on reader feedback. It’s worth adding, too, that GPTZero (created by Edward Tian) has received an update to its AI detection model (we tried the original version in our first comparison test in early February). In fact, it sounds like GPTZero users can look forward to further improvements as Tian and his machine learning team integrate several large scale datasets from ed-tech partners over the coming weeks.

Replacing words and shortening sentences

Our first stage in trying to fool AI content detectors involved replacing first one- and then two-words per paragraph with a human-selected alternative. Neither Crossplag, the OpenAI Classifier, nor GPTZero fell for our AI detection trick. All three refused to budge from their initial base case assessments, although GPTZero did register a slight bump in ‘perplexity’ (shifting from 24.571 to 26.571). Perplexity, according to notes that accompany the AI text classifier, is a measurement of the randomness of the sample text. In our first round of analysis, a 100% guaranteed human-written news story registered a perplexity score in excess of 500.

Copyleaks and Writer shifted their probabilities slightly, but not enough for us to claim that we’d fooled the AI content detectors with our simple word swapping. More success did come our way, however, when we took the base case and shortened the sentences. Or so we thought, until we noticed that a typo had occurred during our sentence shortening process.

Computers don’t make mistakes

It turns out that one of the easiest ways to fool AI content detectors is to include a typo. Simply misspelling the word ‘include’ as ‘inlcude’ was sufficient to convince Crossplag that the text now had less than 50% probability of being AI-generated (down from 100%). Splitting up the sentences was enough for Writer to badge the AI chatbot output as being 99% human written. And, interestingly, adding the typo lifted that value to 100%.

OpenAI’s Classifier was more resilient. But with only five scoring levels, ranging from very unlikely (being the most human) to likely AI-generated, OpenAI’s Classifier was the vaguest of the AI detection tools in the test. Between the two extremes, documents are classified as unlikely, unclear, or possibly AI-generated. All of the other AI text classifiers provided some kind of numerical output, typically a percentage score.

Our final stage of document manipulation was to apply a very light edit to the base case version that had been split into shorter sentences. And this was easily sufficient to fool three out of the five AI content detectors. After the light edit, Copyleaks proclaimed that, “This is human text.” The Writer AI classifier agreed. And the previously sceptical Crossplag had now dropped its probability of the text being AI-generated to just 1%.

GPTZero’s perplexity and burstiness (the degree of perplexity throughout the whole document) had both risen due to the edits. But the values were still some way off the levels registered for 100% guaranteed human written text, which suggests that the AI detection tool still performed well despite the attempt to fool it with a light edit. However, when you examined which portions of the text had been highlighted (to alert the user to suspected AI-generated content), the selection was hit and miss.

There are certainly calls for AI content detectors to be used in sectors such as education, but – based on these tests – it still feels like early days. And will AI classifiers ever be able to say for sure whether text has been written by a human or machine-generated?