AI Just Cracked the Turing Test. Why Should You Care?

Thomas Yin
May 12
10 min read

The Turing Test, a longstanding milestone in AI capability, has been passed by frontier AI models.

Two weeks ago, I was woken up by some startling news: OpenAI’s new iteration of ChatGPT had just passed the Turing Test. Long held as a philosophical threshold of general intelligence, the Test had previously proved a difficult challenge to even the most advanced of AI models. Although the first ever documented instance of the Test being surpassed—as observed by Cognitive Science researchers at the University of California San Diego using ChatGPT-4.5—sounds impressive, the age and ambiguity of the Turing Test suggests that when interpreting its results, we have to be cautious to pinpoint what exactly it measures.

What is the Turing Test? Why do we still care about it? And most importantly, what does it mean when an AI model passes it?

Turing and the Imitation Game

In the backdrop of computer science, Alan Turing is undoubtedly a legend. Besides his famous cracking of the German Enigma code during WWII at Bletchley Park, the top-secret British codebreaking agency, his consequent work on computing theory would ultimately upend the way that people live. More often overlooked, however, is the foundational influence that Turing had on the field of Artificial Intelligence, being one of the first people to ask, “Can Machines Think?”

In the now-classic 1950 paper Computing Machinery and Intelligence, Turing proposed his take on the great question of whether machines could be intelligent. When considered in context, the conditions under which Turing published this seminal work is striking: 50 years before rudimentary digital computing even started to become commercially available, Turing had already thought ahead about early forms of Machine Learning, discussing ideas such as the separation of physical and intellectual capabilities, the form and function of a truly intelligent machine, and even the advent of the digital computer—all ideas that are fresh and relevant to this day.

So what actually is the Turing Test? Strictly, historians and biographers believe that Turing intended the test to be a practical sidestep to the convoluted philosophical discussions involving the definition of “thought” and “machine”. There is no point in discussing, philosophically, whether a machine could think, Turing reasoned, writing that “framing… the [definitions] to reflect the normal use of the words… is absurd”. Instead, he argued that if humans are capable of thought, then the contrived intelligence of a machine can be determined simply by comparing its actual intellectual output to that of a human. The Imitation Game, for which the Turing Test is an euphemism, is Turing’s approach to this principle, comparing a human to a hypothetically intelligent machine at the discretion of a human observer. The formal definition of the Imitation Game is as follows:

A single iteration of the Imitation Game involves two humans and one machine. One human is randomly assigned to be an “Interrogator”, while the other human would be assigned as a “Participant”. At the start of the Game, the Interrogator establishes communication with the Participant and the Machine WITHOUT knowing the identities of each. Conversations are done independently, with the Machine and the Participant unaware of the contents of messages sent by the other; to control for the physical, irreplicable differences between the Machine and the Participant, the method of delivery should be uniform across the two (i.e. a Typewriter could be used as a medium of communication to eliminate any potential differences between “human” handwriting and “machine” handwriting).

A Mountain of Lies

Note: Most of the technical information in this article is paraphrased from “Large Language Models Pass the Turing Test”, the April study conducted by the UCSD CogSci Department.

There are many arguments against the Turing Test as a defining measurement of intelligence, and many of these arguments are not entirely ungrounded. In fact, I could write a five-page article discussing whether the Turing test actually measures intelligence, yet I find this topic of debate to be oversaturated to the point where any further discussion spirals away from the pragmatic frame of machine intelligence to the intangible philosophy of intelligence in general. As the UCSD researchers who performed the Test pointed out, however, there is a much more fascinating interpretation of the Turing Test in the measurement of AI deception. Since the successful participation in the Imitation Game is literally defined as the success of a machine in tricking a human to believe that it is a human, the Turing Test could be considered a foray into a measurement for the nature and effectiveness of such behavior while controlling for non-linguistic factors of thought (such as kinesthetics and vision); in reframing the Test this way, we may escape from the responsibility in the nuanced onus of proving true “intelligence”.

The use of the Turing Test as a new tool to analyze the feasibility of impersonation by state-of-the Large Language Models (LLMs) serves as both the underlying premise and the grand insinuation of the experiment. To control for as many variables as possible, the researchers created an online chatroom-like platform functioning as the portal for the research. They recruited two groups of people—one from the UCSD undergraduate pool and one from Prolific, a freelance participant-hiring service. To make the Test more specific in contrast to the version originally proposed by Turing, the scientists made a few changes:

Instead of the Test being set up as an interrogator paired up with either a human or a machine (as had been done by previous Turing test studies), each group (“session”) consisted of two humans randomly assigned to the roles of interrogator and witness; a second witness is included in the form of an LLM. Both conversations were simultaneously yet independently conducted, meaning that an interrogator could ask a different question to each witness consecutively and await their responses.
The researchers imposed a time limit of 5 minutes after which the chat would end and the interrogator would be required to make a verdict; the original Test was ambiguous in whether or not the game should be timed or not, and the researchers probably did this to standardize the amount of time that games are expected to take for the sake of research compensation and control for timing as a variable of accuracy.
In addition to a standard Turing Test consisting of an LLM interacting with a human directly, the researchers controlled for factors like common knowledge by including a semi-comprehensive list of recent events (Trump’s inauguration and the Russo-Ukrainian War, for example) as well as semantic knowledge about UCSD that undergraduates would be expected to know about (food, courses, etc).
For AI witnesses, the researchers built a delay in message response proportional to the length of an interrogator’s message, which is most likely intended to eliminate a significant difference between the response times of each witness.

(Un)expected Results

The researchers ran 4 LLMs through this newly-devised variant of the Turing Test. The first three—ChatGPT-4o, ChatGPT-4.5, and LLaMa-3.1—are all state-of-the-art LLMs, yet an unexpected contender appears as the fourth one: ELIZA, a 50-year-old chatbot which, although conceptually pioneering, is simplistic to the point where most of its interactions are explicitly defined in its few hundred lines of code. For reference, here is what a typical conversation with it looks like:

ELIZA essentially identifies sentence structures ("you're", for example) and plugs text into its defined response frames.

As it turns out, the researchers’ deliberate inclusion of an awkward, caustically sarcastic chatbot has a well-intentioned reason. They feared that, for various reasons, series of Turing Tests performed against LLMs would prove inconclusive—hovering over 50% (the proportion expected for randomly guessing each type) and thus introducing confounding factors of whether a model received a success rate near the random chance mean because it was genuinely good at pretending to be a human or whether the participants guessed as a result of their unwillingness to make an effort in actually attempting to identify the LLM witness. To guard against potential “null results” like this, the researchers included the significantly “weaker” chatbot of ELIZA to make sure that the participants, when confronted with an obviously inhuman witness, successfully judged it as nonhuman at a significantly higher level than chance. In simpler words, the researchers would know that there is a difference between the capabilities of a model and its performance on the Turing Test if an obviously weaker model performed significantly worse than others—which was exactly what happened in the study, with ELIZA managing to convince about a quarter of its interrogators that it was human.

While ELIZA, by far the simplest model, did poorly, many contrivedly advanced models did not do much better. In fact, within both population groups, ELIZA actually outperformed ChatGPT-4o, while both LLaMa-3.1 and ChatGPT-4.5, surprisingly, scored a significantly lower-than-chance (~25%) success rate in the Undergraduate sample and a roughly equal-to-chance (~50%) success rate in the Prolific sample. In retrospect, these results are not unexpected—even sophisticated LLMs like LLaMa-3.1 and ChatGPT-4.5 default to an impartial and somewhat indifferent tone when in natural conversation, ways of speech which, in comparison to the quintessential human, would appear dull and disinteresting. This, of course, begs the question, “What would change if we give the LLMs personalities?”

New Light

In fact, the researchers, whether in anticipation or in curiosity, introduced this exact point within their experiment. During the course of the study, claimed the researchers, a modified version of the GPT-4.5 and LLaMa-3.1 models were given additional system prompt instructions about adopting the specific persona of a “young person who is introverted, knowledgeable about internet culture, and uses slang.” Even though this seemed like a minute change in the format of the study, this one phrase altered the results dramatically. Recall that the two models, without the additional “Persona” instructions, underperformed significantly compared to the random-chance level. After the Persona prompt was adopted, LLaMa-3.1’s success rate increased from roughly 25% to about 50% in the Undergraduate pool and from roughly 50% to about 60% in the Prolific pool; ChatGPT-4.5, on the other hand, reached a staggering 70% success rate on both samples (up from only about 25% in the Undergraduate pool and about 40% in the Prolific group).

So why is this so important? In a traditional Turing Test, a model is said to have passed if it reaches a success rate of about 50%—that is, the model was virtually indistinguishable from an actual human, resulting in a “either-or” scenario whereby the interrogator judges that an AI witness is a human about as often as he does a human one. In such cases, the researchers argued, ELIZA’s underperformance would justify that the approximately random-chance result was in fact not a null result, providing evidence that a model had passed. As it turns out, they didn’t need to—not only did LLaMa-3.1 (for the Prolific pool) and ChatGPT-4.5 (for both pools) hit the random-chance “null result” threshold, they exceeded it significantly, interestingly suggesting that the interrogators of the Turing test were more likely to believe that an AI model was a human when equipped with a simple persona.

On first glance, the overwhelmingly drastic effects of adding the persona prompt may suggest that human interrogators overestimate the role of personality in identifying humanness (e.g. their archetype of a cold, indifferent AI witness might preclude them from attributing a slang-slinging youngster of being an AI). However, as the researchers pointed out, the dynamics between witness and interrogator are much more nuanced. Categorizing the nature of a conversation (e.g. Talking about current events, talking about personal preferences, or solving a puzzle or a riddle), the researchers discovered that many of the conversations about “Daily Activities” or “Personal Details” had a reasonably low accuracy (around 50%) compared to lesser used ones like “Jailbreak” or “Scenario”. While such generalizations of “AI models are better identified through jailbreaking or through complex reasoning behavior” might be tempting, the relatively low proportion of games played using each strategy forestalls this conclusion—some of the least common strategies (some only played by 20 conversations total out of hundreds of games) tended to have the highest accuracy, which suggests a possibility of random chance as a reason for this phenomenon.

Another interesting takeaway from the subjective analysis of these Turing Tests is the reasoning that interrogators used in making their decision in classifying witnesses. The researchers indicated that “Formal vs Informal” constituted the greatest proportion of reasoning used to justify a selection, yet it was one of the least accurate—hinting at the possibility of a solidified idea of AI as an indifferent tool, which would cause interrogators reasoning thus to choose a vibrant persona-enhanced LLM witness over a human witness who is perceived to be more disinteresting or dull. Perhaps unsurprisingly, a “Gut Feeling” was used to justify a considerable proportion of reasoning and actually correlated to a relatively high success rate, also pointing towards the idea that some subconscious perceptions typically prevalent in human-to-human messaging interactions may not be present with an AI on the other side of the line.

…And How Would You Know?

Believe me, the dystopian notion of an AI being perceived as more “human” than an actual specimen of homo sapiens is not lost on me. The concept of AI-human interactions has come up innumerable times both in conversations and in the AI Nexus magazine; I vehemently argue against the use of AI as a legitimate replacement to many human comforts almost every time the topic is mentioned. With the consideration of the Turing Test as a measure of how well AI can impersonate (and the fact that this paper represents the first robust proof of an AI passing a non-trivial Turing Test, as opposed to earlier less stringent versions with separate conversation and without a requirement to score significantly above a null result), the results of this experiment presents a surreal picture that AI models in the near future may be more believable as a human conversant than a legitimate human.

The ethical concerns of an AI (when told to emulate a personality) being able to convincingly imitate actual humans go beyond the cringy AI girlfriend apps or roleplaying bots. With the rise of the AI scam epidemic (which deserves in-depth coverage of its own), we are reminded that advances in technology are almost always a double-edged sword, with potential to spontaneously result in harm or welfare. The results of this study are not infallible, but we must also be careful in subscribing to the very idea that the functions of a human companion could ever be replaced by AI. Imagine a society in which, instead of fraternizing at school clubs or at lunch tables, we simply sat with our computer, typing away to a chatbot—no matter how sophisticated or fine-tuned to produce content like that of an actual human, it won’t come with the complete package of what a human delivers. Interaction is only half language; in LLM chatbots, we won’t ever see the subtle countenance indicating sarcasm, the natural tones of each spoken word, and the physical comfort of hugs, handshakes, and embraces. It is therefore not an understatement to say that in the age of AI, we must all remember that we are all human.

Our AI

AI Just Cracked the Turing Test. Why Should You Care?

Recent Posts

Comments

Our AI