How easy is it to fool AI content detectors?
Running a head-to-head test of AI text screening tools – where we fed a variety of articles through AI detection algorithms to determine the capabilities of online, free-to-use classifiers – got us thinking. And we’re back with a sequel, this time looking at how easy it is to fool AI content detectors. The debate is still running on whether AI detection tools are a good or bad thing. And part of that discussion rests on how much confidence we can have in the ability of algorithms to recognize human-written versus machine-made text. The value of AI content detectors soon drops away if machine-generated documents are wrongly attributed as being human written. And, considering false positives, if original content is stamped with a warning for containing parts written by AI.
Generating a base case
Our second round of AI testing began, as is the trend, with ChatGPT. To provide a base case, we prompted OpenAI’s advanced chatbot to generate ‘a 500-word news story on how organizations can protect themselves from phishing email scams’. ChatGPT responded 36-words shy of our request, but the 100% AI-generated text was sufficient for testing. Next, we ran the base case through five, free-to-use online AI content detectors:
Four of the tools were used in the previous round of AI text screening. And we added a new one – Crossplag – to the list, based on reader feedback. It’s worth adding, too, that GPTZero (created by Edward Tian) has received an update to its AI detection model (we tried the original version in our first comparison test in early February). In fact, it sounds like GPTZero users can look forward to further improvements as Tian and his machine learning team integrate several large scale datasets from ed-tech partners over the coming weeks.
Crossplag correctly identified the base case as being 100% machine made; GPTZero summarized the 464-word sample as being written entirely by AI; and the OpenAI Classifier considered the text to be likely AI-generated (it’s strongest validation). Copyleaks registered a 93% probability of AI (even though the ChatGPT-generated base case should be considered as 100%). And Writer was least confident of all, considering our machine-made sample to be 75% human written. It should be said that Writer is the only tool that was unable to digest the whole document, being limited to a maximum of 1500 characters. And, even though the AI screening tool reported 75% human-generated content, the classifier still recommended editing the text until there were less detectable AI elements.
Replacing words and shortening sentences
Our first stage in trying to fool AI content detectors involved replacing first one- and then two-words per paragraph with a human-selected alternative. Neither Crossplag, the OpenAI Classifier, nor GPTZero fell for our AI detection trick. All three refused to budge from their initial base case assessments, although GPTZero did register a slight bump in ‘perplexity’ (shifting from 24.571 to 26.571). Perplexity, according to notes that accompany the AI text classifier, is a measurement of the randomness of the sample text. In our first round of analysis, a 100% guaranteed human-written news story registered a perplexity score in excess of 500.
Copyleaks and Writer shifted their probabilities slightly, but not enough for us to claim that we’d fooled the AI content detectors with our simple word swapping. More success did come our way, however, when we took the base case and shortened the sentences. Or so we thought, until we noticed that a typo had occurred during our sentence shortening process.
Computers don’t make mistakes
It turns out that one of the easiest ways to fool AI content detectors is to include a typo. Simply misspelling the word ‘include’ as ‘inlcude’ was sufficient to convince Crossplag that the text now had less than 50% probability of being AI-generated (down from 100%). Splitting up the sentences was enough for Writer to badge the AI chatbot output as being 99% human written. And, interestingly, adding the typo lifted that value to 100%.
OpenAI’s Classifier was more resilient. But with only five scoring levels, ranging from very unlikely (being the most human) to likely AI-generated, OpenAI’s Classifier was the vaguest of the AI detection tools in the test. Between the two extremes, documents are classified as unlikely, unclear, or possibly AI-generated. All of the other AI text classifiers provided some kind of numerical output, typically a percentage score.
Our final stage of document manipulation was to apply a very light edit to the base case version that had been split into shorter sentences. And this was easily sufficient to fool three out of the five AI content detectors. After the light edit, Copyleaks proclaimed that, “This is human text.” The Writer AI classifier agreed. And the previously sceptical Crossplag had now dropped its probability of the text being AI-generated to just 1%.
GPTZero’s perplexity and burstiness (the degree of perplexity throughout the whole document) had both risen due to the edits. But the values were still some way off the levels registered for 100% guaranteed human written text, which suggests that the AI detection tool still performed well despite the attempt to fool it with a light edit. However, when you examined which portions of the text had been highlighted (to alert the user to suspected AI-generated content), the selection was hit and miss.
There are certainly calls for AI content detectors to be used in sectors such as education, but – based on these tests – it still feels like early days. And will AI classifiers ever be able to say for sure whether text has been written by a human or machine-generated?
20 March 2023