Defining a Reasoning Model
Before diving into Apple’s research, it’s important to understand what reasoning models actually are and how they differ from conventional large language models.
At their core, all large language models function by predicting the next token in a sequence. You feed it a prompt, and it completes it word by word or symbol by symbol. But that makes them inherently reactive rather than deliberative.
Reasoning models introduce a layer of structured thinking. Through techniques like chain of thought prompting and reinforcement learning, they are trained not just to predict the next word but to pause and reflect before answering. One such method, called group relative policy optimization, rewards the model for generating intermediate steps that lead to a correct solution.
The result is a model that mimics a kind of internal monologue before responding, simulating the way humans work through complex problems.

Apple’s Paper: The Illusion of Thinking
In The Illusion of Thinking, researchers from Apple sought to quantify how well reasoning models perform when complexity scales. Their tests included a classic problem in mathematical logic: the Tower of Hanoi.
The experiment was structured in stages. They started with basic instances of the puzzle and gradually increased the number of disks, thereby increasing the number of moves and steps required to solve it.
The models tested included:
- GPT 4 and GPT 4 Turbo (reasoning variant)
- Claude 3 and Claude 3 Opus
- DeepSeek V2 (open source reasoning model)
The findings were surprising. In simple problems, both standard and reasoning models performed similarly. But as complexity increased, the models diverged before eventually collapsing entirely around the eight-disk mark.
Some reasoning models even underperformed compared to standard large language models on basic problems, likely due to overthinking and introducing errors.
Why Did They Fail?
The most controversial claim in Apple’s paper is that reasoning models give up once a problem becomes too complex. But is that a failure of reasoning or a reasonable limitation?
Modern large language models operate under strict constraints. Most are restricted in how many tokens they can process, how long they can spend thinking, and how expensive a computation can become. These are not technical flaws; they are design choices.
As a result, many models may choose to truncate their thought process once they detect an infeasible path forward. Critics argue that this is not failure; it is actually evidence of rational behavior under limited resources.
In fact, some models explicitly said as much during testing. When asked to solve a 20-disk Tower of Hanoi problem, DeepSeek V2 responded that the problem would require over 10,000 steps and was not feasible within its current constraints.
This leads to a philosophical question. If a model determines that a task is too costly and chooses not to proceed, is that a lack of reasoning or a form of it?

Critics Respond: “The Illusion of the Illusion”
Shortly after Apple’s paper began circulating, critics emerged questioning the study’s methodology and conclusions. A common point of contention was the choice of test problems.
The Tower of Hanoi is a well-known puzzle in computer science, with its rules and solutions widely available across blogs, research papers, and training datasets. This raises the question: if a model solves simpler versions, is it truly reasoning, or just recalling what it has seen before?
Critics also noted that when models fail on more complex instances, it may not reflect a reasoning flaw but rather a lack of prior exposure or intentional limits. In some cases, models appeared to stop short not because they couldn’t continue, but because solving the problem would exceed practical compute limits, suggesting resource management rather than failure.
What Is Reasoning, Anyway?
At the heart of the debate is a bigger issue. We do not have a shared definition of reasoning in AI.
Is it:
- A process of step-by-step token generation
- An internal simulation of logical rules
- The ability to plan across unseen problems
Depending on who you ask, reasoning could mean all or none of the above. What is clear is that the current benchmark for reasoning is deeply entangled with how we design, constrain, and evaluate these systems.
Even DeepSeek’s own research shifts in its definitions. One early paper described reasoning improvements as boosting correct responses from top K sampling, a statistical trick rather than a cognitive skill. A later paper claimed its models had improved reasoning ability through reinforcement learning.
So which is it? Are these models truly thinking or just better guessers?
Conclusion: The Debate Isn’t Over
In the battle between Apple’s skepticism and the AI world’s optimism, the answer for now is inconclusive.
Apple’s Illusion of Thinking paper raises important questions about how we define, train, and evaluate AI models that claim to reason. But it may also oversimplify complex behaviors that do not fit cleanly into what works or does not work.
What is clear is that this debate is far from over. As reasoning models evolve and as researchers better define what reasoning actually means, we will likely see new benchmarks, new architectures, and entirely new approaches emerge.
For now, the only certainty is that reasoning in AI is still very much a work in progress.