LLMs can’t reason – The Reversal Curse, The Alice in Wonderland test, and the ARC – AGI Challenge

LLM

There is a lot of magical thinking which states that Large Language Models are on the verge of achieving Artificial General Intelligence or  AGI. And while increasing the training data, and compute has shown impressive performance gains in these systems, there is a more sobering reality. Namely, that LLMs can’t actually reason and that benchmarks only measure the ability to measure how well an LLM can memorize programs, versus synthesizing new and novel ideas and solutions to problems. 

In the following, we will present several examples of why the broader AI research community has doubts about Transformer based Large Language Models’ ability to truly reason, the problem with benchmarks, and why we should focus less on the next huge model, and more on how generative AI can enhance productivity in its current form. AGI, is an exciting idea, but it isn’t here yet or even necessary for what the majority of users need it for. 

Down the Rabbit Hole we go

A paper by Marianna Nezhurina et al proposes that LLMs, while touted as quite capable, actually struggle with reasoning. The authors demonstrate this struggle by testing several state-of-the-art LLMs (both closed and open-source) on a simple, common sense reasoning task they call the “Alice in Wonderland” (AIW) problem. Here’s the prompt which you can test yourself.

[Prompt] “Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?”

[Correct Answer] Alice’s brother has M+1 (or two) sisters.

We tested Claude 3 Opus, lama 3 70B, Gemini Advanced, Mixtral 8x22B, and GPT-4o. On this prompt, only Gemini Advanced produced the correct answer. However, in the paper, they show that Claude 3 Opus was, at times, able to successfully solve this simple prompt. 

So why does this happen? The paper highlights a dramatic breakdown in the reasoning capabilities of the LLMs tested. Even when encouraged to think carefully and double-check their work, most models consistently provided wrong answers. This suggests that their “reasoning” processes are not reliable, and they often fall back on memorized patterns or flawed logic. 

Curiouser and Curiouser

But what about the high benchmark scores of the State-of-the-Art LLMs like GPT-4o and Gemini Ultra? These models have boasted near human level capabilities on things like the MMLU or Massive Multitask Language Understanding benchmark. Moreover, OpenAI claimed that GPT-4 passed the Bar Exam with a score in the 90th percentile! It turns out that when PhD candidate Eric Martinez from MIT re-evaluated these results, he found GPT-4’s score was actually closer to the 48th percentile and 15th percentile for the Essay portion. OpenAI tested against a testing pool of people who had already taken the Bar exam and failed – measuring instead against their repeat test scores. So the sample group which GPT-4’s score was calculated against were already less likely to score well on the exam, skewing the results for GPT-4. 

The Reversal Curse

LLMs Trained on “A is B” fail to learn “B is A”.

Who is Tom Cruise’s mother? While you wouldn’t be expected to know this off the top of your head (or at all, really), GPT-4o certainly knows that the answer is Mary Lee Pfieffer. In LLM circles, many people know her name because it is the main example of how Transformer based LLMs can’t really reason. How does Tom Cruise’s mother have anything to do with Language Model reasoning? Well, if you ask ChatGPT who Mary Lee Pfieffer’s son is, it doesn’t know the answer. Or… at least it didn’t. However, when we ran this test again, it was able to tell us that, indeed, her son’s name was Tom Cruise! 

So maybe OpenAI has solved this problem and that they’ve truly improved the reasoning of their latest model? Well, we decided to ask the same question but we switched the celebrity to Tom Hanks. ChatGPT answered “Tom Hanks’s mother is Janet Marylyn Frager. She was a hospital worker and of Portuguese descent.” Then we asked, “Who is Janet Marylyn Frager’s son?” to which it answered “Janet Marylyn Frager’s son is Tom Cruise, the famous American actor and producer. Tom Cruise was born Thomas Cruise Mapother IV on July 3, 1962.” 

Whoops! It looks like GPT-4o has trained on the “Tom Cruise’s mother” test and therefore it not only hasn’t suddenly become a better reasoner, it actually has clear overfitting on this question since its answer gives away its assumption that it is being asked this tricky question. 

It Gets Worse

Francoise Chollet from Google recently appeared on the Dwarkesh Patel podcast to discuss the benchmark that he created called the Abstraction and Reasoning Corpus for Artificial General Intelligence or ARC-AGI. The ARC-AGI Prize website states the following: “Most AI benchmarks measure skill. But skill is not intelligence. General intelligence is the ability to efficiently acquire new skills. Chollet’s unbeaten 2019 Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is the only formal benchmark of AGI. It’s easy for humans, but hard for AI.” 

While it was actually created 4 years ago, no LLM has been able to crack 35% (the average human can solve 85% of these questions). The reason, he argues, is that LLMs are essentially overfit on the benchmarks that they are tested on and that, while there is something impressive about this phenomenon, it isn’t reasoning (not even close). Overfitting is when a model is overtrained on data so that, instead of generalizing at inference, it uses the data it remembers from its training. This isn’t necessarily a bad thing, but when AI companies claim that their models are reasoning, it’s tough to prove since these claims are always tied to a specific benchmark that very likely contaminated the training data during pre-training. 

The conversation with the host Dwarkesh is at times pretty tough to watch. He asks over and over again why LLMs aren’t doing something similar to what happens in the human brain and asks why this shouldn’t be considered similar to reasoning in the traditional sense of the term. 

In reply, Francois argues that LLMs are fundamentally limited in their ability to reason because they primarily rely on memorization and interpolation. They are essentially large, complex databases of patterns and information that can be used to generate outputs based on what they’ve seen before. He contrasts this with true reasoning, which he defines as the ability to synthesize new programs or solutions on the fly, based on existing knowledge and understanding. LLMs struggle with this because they lack the ability to adapt and learn from novel situations.

Dwarkesh Patel, while acknowledging Francois’ skepticism, tried to argue that LLMs might be on a path to AGI by highlighting their ability to generalize and learn new tasks, even from limited data. Here are some of the key points he brought up and how Francois countered them:

1. “In-context learning” with Gemini 1.5 learning a new language from a dictionary:

  • Dwarkesh: He pointed out how Gemini 1.5, given a dictionary and grammar book of a low resource language with few speakers, could speak and translate that language, suggesting efficient learning.
  • Francois: He countered that this was likely still based on the model’s massive pre-training data, where it had learned patterns and templates that it could apply to the new language. He argued that LLMs struggle with true novelty, and this task might not have been truly out of distribution for the model.

2. Scaling of models and learning capacity:

  • Dwarkesh: He noted how larger LLMs seem to pick up more complex reasoning patterns and skills that smaller models couldn’t handle, suggesting a path to greater generalization through scaling.
  • Francois: He admitted that larger models might be able to learn more complex patterns, but he argued that this was simply an increase in skill, not intelligence. He emphasized that true intelligence is not about scaling up specific skills, but about the ability to learn and adapt to any novel situation with minimal data.

3. The human spectrum of intelligence and LLMs’ place on it:

  • Dwarkesh: He pointed out the variability in human intelligence, noting how some humans struggle with basic reasoning tasks while others excel. He suggested that LLMs might be on this same spectrum and that their current limitations might be overcome with further development.
  • Francois: He countered that LLMs, even with their large size, are still vastly under-parameterized compared to the human brain. He argued that their current “reasoning” is more akin to memorization and applying pre-learned patterns, and that they haven’t shown the ability to adapt to truly novel situations like the ARC benchmark demands.

4. The possibility of program synthesis within LLMs:

  • Dwarkesh: He proposed that perhaps the inner workings of LLMs might be performing a form of program synthesis, where they combine representations learned from their training data into new solutions.
  • Francois: He countered that if LLMs were truly capable of program synthesis, they would perform much better on ARC. He argued that their current capabilities still heavily rely on memorized patterns and templates, and that they haven’t demonstrated the ability to create truly novel programs.

5. The role of “training” and “memorization” in human learning:

  • Dwarkesh: He argued that humans, despite their innate abilities, also require significant “training” to learn complex skills, suggesting that LLMs might be following a similar learning trajectory.
  • Francois: He countered that the type of learning humans undergo is not simply memorization. While humans do learn by practice and repetition, they also possess a unique ability to generalize and adapt to new situations, something LLMs haven’t yet achieved.

Throughout the conversation, Francois consistently emphasized that the current capabilities of LLMs are impressive but limited, and that true AGI requires a different paradigm that goes beyond memorization and interpolation. He believes that a focus on discrete program search and synthesis, combined with elements of deep learning, holds the potential to achieve AI systems that can truly reason and adapt to novel situations.

Conclusion

While this post may be disappointing to some, it’s important to understand where we are in the current state of the art in Language Models. They are undoubtedly amazing tools and it’s difficult to imagine life without them. When Generative AI burst onto the scene in 2022, nobody could have predicted the impact they would have on our lives. They can summarize text, perform sentiment analysis, do classification, write code, create images, audio, video, and more. However, we’re still in the very early days of Artificial Intelligence and are nowhere near AGI, let alone Artificial Super Intelligence (ASI). So let’s enjoy them for what they are, and not over sell them for what they aren’t. At the end of the day, the vast majority of businesses are getting value from a good RAG system. Nobody is offloading pure reasoning tasks to AI powered chatbots, and that’s probably a good thing since, as we’ve just demonstrated, they aren’t any good at that anyway… 

Build a Custom GPT for your business, in minutes.

Deliver exceptional customer experiences and maximize employee efficiency with custom AI agents.

Trusted by thousands of organizations worldwide

Related posts

1 Comment


Avatar photo
Jean-Philippe
July 29, 2024 at 6:36 pm
Reply

I agree with the conclusions of this article. To me, this isn’t a letdown, I believe AI has other purposes than AGI, and its use cases are simply fascinating – and hard to grasp at this stage.
By the way, it’s Françoi Chollet, not Françoise 🙂
Best,
JP


Leave a reply

Your email address will not be published. Required fields are marked *

*

3x productivity.
Cut costs in half.

Launch a custom AI agent in minutes.

Instantly access all your data.
Automate customer service.
Streamline employee training.
Accelerate research.
Gain customer insights.

Try 100% free. Cancel anytime.