Benchmark

Claude Code is 4.2x faster & 3.2x cheaper with CustomGPT.ai plugin. See the report →

CustomGPT.ai Blog

Chatbots and the Turing Test: Are Al Chatbots Getting Smarter Than Us?

Chatbots and the Turing Test

A Google study showed its medical LLM pilot “AMIE” outperforming primary care doctors on 28 out of 32 characteristics in a test somewhat akin to the Turing Test. 

An Associate Professor at Wharton, Ethan Mollick, shared a post with the details beginning: “a provocative study from Google where LLMs passed a Turing Test, of a sort, for doctors.” following Google’s publication. 

Let’s find out exactly what the Turing Test is and what Google’s study of AMIE discovered. 

What is the Turing Test?

The “Turing” test, developed by Alan Turing in 1950, is an assessment of a machine’s ability to behave indistinguishably from a human. The test was originally called the imitation game and involves an interrogator trying to determine which of two players is a computer and which is a human by comparing the player’s responses to questions. 

Turing published the first paper focusing entirely on machine intelligence, “Computing Machinery and Intelligence” (1950).

The Turing test has been influenced and criticized, but it’s still a pivotal concept for AI. “ELIZA,” a program built by Joseph Weizenbaum in 1966, has been argued by some to have passed the Turing test. In 2014, a chatbot called “Eugene Goostman” was reported to have convinced a third of judges in a Turing Test competition that it was a 13-year-old boy; some also consider this a pass. Generally, there’s not yet a universal consensus that the Turing test has been surpassed, despite recent advances in generative AI technologies. 

Did “AMIE” Pass the Turing Test?

Mollick shared his post on LinkedIn following Google’s study published on the Google Research blog on January 12, 2024, amid broader 2024 AI human-in-the-loop predictions

“149 actors playing patients texted live with one of 20 primary care doctors or else Google’s new medical LLM, AMIE. Specialist human doctors & the “patients” rated the quality of care. AMIE beat the primary care doctors on 28 out of 32 characteristics, and tied on the other four, as rated by human doctors. From the perspective of the “patients,” the AI won on 24 of 26 scales.” (Ethan Mollick, Associate Professor at The Wharton School).

image 149

(Image Source: Ethan Mollick/Google Research) 

In its study, Google describes AMIE as “A research AI system for diagnostic medical reasoning and conversations.” It says:

“Recent progress in large language models (LLMs) outside the medical domain has shown, amid debates over LLM reasoning vs. memorization, that they can plan, reason, and use relevant context to hold rich conversations. However, there are many aspects of good diagnostic dialogue that are unique to the medical domain.”

Clinicians take a complete clinical history, ask “intelligent questions,” and “wield considerable skill” to make diagnoses, foster patient relationships, and make decisions with the patient. The tech giant says that AMIE (Articulate Medical Intelligence Explorer) was developed because there’s “been little work specifically aimed towards developing these kinds of conversational diagnostic capabilities.”

To test AMIE, Google developed its pilot evaluation rubric and a “randomized, double-blind crossover study of text-based consultations with validated patient actors interacting either with board-certified primary care physicians (PCPs) or the AI system optimized for diagnostic dialogue.”

AMIE reportedly performed “at least as well as PCPs when both were evaluated along multiple clinically-meaningful axes of consultation quality.” But Google qualifies AMIE’s limitations as a “first exploratory step,” saying its evaluation technique likely underestimates the real-world value of human conversations. 

The tech giant isn’t boasting that AMIE passed the Turing test, but the results are certainly interesting, further illustrating the rapid progress of generative AI and LLMs. 

When Will the Turing Test AI Milestone be Passed?

Mustafa Suleyman is a co-founder of DeepMind, now a division of Google and also the found of Inflection.ai. Of the Turing test, he says:

“It’s totally unclear whether this is a meaningful milestone or not. It doesn’t tell us anything about what the system can do or understand, anything about whether it has established complex inner monologues or can engage in planning over abstract time horizons, which is key to human intelligence.”

Suleyman argues the 70-year-old Turing test should be replaced. He suggests a test for “artificial capable intelligence,” ACI, for programs that can set goals and achieve complex tasks with minimal intervention. The AI expert expects AI will pass this threshold within the next two years and that the consequences for the world economy are “seismic.”

DeepMind co-founder and chief AGI scientist for Google, Shane Legg, predicts there’s a 50% chance that AGI (Artificial General Intelligence) will be developed by 2028, per a Time article discussing “When Might AI Outsmart Us?”

Anthropic co-founder and CEO Dario Amodei expects “human-level” AI in two to three years. OpenAI CEO Sam Altman says AGI is achievable in the next four to five years. 

An AI Impacts survey of 1,712 AI experts asked when they thought AI would be able to accomplish tasks better and more cheaply than humans, and the results were not as optimistic as some AI leaders. 

image 147
https://time.com/6556168/when-ai-outsmart-humans/

Of course, it’s not clear when the Turing test will be universally considered passed or when ACI or AGI will be determined. However, with AI spending surpassing hundreds of billions, the race between developers is certainly underway, and it is highly likely that LLMs will evolve at least as quickly in 2024 as they did in 2022 and 2023.

For more about the future of AI, try our 2024 Prediction Series Wrap-Up: Our Top 7 AI Predictions for 2024. Or read The Future Unveiled: 5 AI-Driven Employment Opportunities Soon to Emerge.

Frequently Asked Questions

Has any AI actually passed the Turing Test?

There is still no universal consensus that any AI has definitively passed the classical Turing Test. The test asks whether a human interrogator can distinguish a machine from a human through conversation alone. ELIZA and Eugene Goostman are two systems that some observers have argued passed, but those claims remain disputed.

Did Google’s AMIE pass the Turing Test?

Not in the strict classical sense. In Google’s study, 149 actors playing patients texted with either one of 20 primary care doctors or AMIE. Ethan Mollick summarized the result by noting that human doctors rated AMIE higher on 28 of 32 characteristics and tied the other four, while patients rated it higher on 24 of 26 scales. That makes AMIE a strong Turing-like result for medical dialogue, but not definitive proof that a machine is indistinguishable from a human in open-ended conversation.

Is the Turing Test still the best way to judge an AI chatbot?

Not by itself. The Turing Test measures how human a chatbot seems, but many users care more about whether answers are accurate, grounded, and verifiable. A reported benchmark showed CustomGPT.ai outperforming OpenAI in RAG accuracy, which illustrates the difference between sounding fluent and answering correctly from known sources. In practice, a stronger evaluation asks: is the answer grounded in source material, does it stay accurate on follow-up questions, and can you verify it?

Why do some advanced chatbots still sound non-human?

Human-like style and task performance are different skills. Dlubal Software’s assistant Mia supports 130,000+ structural engineering users across 132 countries in 10 languages. George Dlubal said, u0022The assistant has enabled us to offer 24/7 support while improving accuracy and speed of response. This has led to a noticeable increase in customer satisfaction and even faster support.u0022 That shows a chatbot can be highly useful even if its pacing, repetition, or phrasing still feels less natural than a person’s.

Do human-like chatbots still need citations and human oversight?

Yes, especially in high-stakes domains. The Tokenizer built its regulatory assistant on a database developed over three years, and Michael Juul Rugaard said, u0022Based on our huge database, which we have built up over the past three years, and in close cooperation with CustomGPT, we have launched this amazing regulatory service, which both law firms and a wide range of industry professionals in our space will benefit greatly from.u0022 A chatbot can sound persuasive and still be wrong, so source grounding, citations, and human review matter more than human-like phrasing alone.

Can a chatbot be useful even if it never passes the Turing Test?

Yes. Many teams use chatbots to automate proposals, answer customer inquiries, and surface internal knowledge rather than to imitate human conversation perfectly. Stephanie Warlick said, u0022Check out CustomGPT.ai where you can dump all your knowledge to automate proposals, customer inquiries and the knowledge base that exists in your head so your team can execute without you.u0022 In real deployments, usefulness, accuracy, and speed often matter more than whether a chatbot can fool a human judge.

Related Resources

If you’re thinking about how chatbot reliability affects human-like performance, this guide adds useful context.

  • Reducing AI Hallucinations — Explore how CustomGPT.ai helps minimize fabricated answers so chatbot responses stay more accurate, trustworthy, and useful.

3x productivity.
Cut costs in half.

Launch a custom AI agent in minutes.

Instantly access all your data.
Automate customer service.
Streamline employee training.
Accelerate research.
Gain customer insights.

Try 100% free. Cancel anytime.