On Saturday, Apple released a research paper that investigates the capabilities and limitations of newly developed reasoning models, also termed large reasoning models (LRMs). These models are engineered to tackle complex problems by allocating additional computational resources. However, the findings reveal that even the most sophisticated models struggle with complex tasks, often failing entirely rather than using the extra compute as intended.
Apple Highlights the Limits of Reasoning Models
The paper, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” is accessible on Apple’s website. In this document, researchers assert that both LRMs and large language models (LLMs), which lack advanced reasoning abilities, exhibit different behaviors when confronted with three levels of complexity.
The study categorizes tasks into three complexity regimes—low, medium, and high. To evaluate the performance of LLMs and LRMs across these varying complexities, the researchers employed several puzzles that increase in difficulty, including the Tower of Hanoi.
The Tower of Hanoi is a well-known mathematical puzzle involving three pegs and several disks stacked in decreasing order of size. The goal is to transfer the disks from the leftmost peg to the rightmost peg, moving one disk at a time without placing a larger disk on a smaller one. While the puzzle is generally not difficult, it is often designed for children aged six to fifteen.
Mathematical puzzles tackled by reasoning models
Photo Credit: Apple
For their experiment, Apple researchers selected two reasoning models alongside their non-reasoning counterparts. The models included Claude 3.7 Sonnet and DeepSeek-V3 for LLMs, while LRMs consisted of Claude 3.7 Sonnet with Thinking and DeepSeek-R1. Each model was allocated a maximum thinking budget of 64,000 tokens. The study aimed not only to assess final accuracy but also to evaluate the logical accuracy in determining the steps necessary to solve the puzzle.
The low complexity task involved up to three disks, while the medium complexity task included disks ranging from four to ten. In the high complexity scenario, participants dealt with between 11 and 20 disks.
Researchers observed that both LLMs and LRMs performed similarly in the low complexity task. However, as complexity increased, the reasoning models demonstrated improved accuracy when provided with additional computational resources. Nevertheless, in high complexity scenarios, both models displayed a complete failure of reasoning.
The experiment was also replicated using additional models and puzzles such as Checkers Jumping, River Crossing, and Blocks World.
This research underlines concerns expressed by others in the artificial intelligence (AI) field. Although reasoning models can generalize from their training datasets, they encounter difficulties when faced with problems that extend beyond their knowledge base, often resorting to shortcuts or entirely abandoning the task.
“Current evaluation standards primarily focus on established mathematical and coding benchmarks, emphasizing the accuracy of final answers. This approach suffers from data contamination issues and fails to provide insights into the structure and quality of reasoning processes,” the company noted in a post.