On Saturday, Apple released a research paper examining the capabilities and limitations of recently developed reasoning models, also referred to as large reasoning models (LRMs). These advanced AI models are designed to tackle complex problems through enhanced computing power. However, the study indicates that even the most sophisticated models face significant challenges when dealing with complexity, often leading to a complete failure to address difficult tasks.
Apple Claims Reasoning Models Have Limited Thinking Abilities
Titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” the paper, available on Apple’s website, asserts that both LRMs and large language models (LLMs) exhibit different behaviors across three categories of complexity.
The study categorizes tasks into low, medium, and high complexity. To analyze how LRMs and LLMs perform with varying levels of complexity, the researchers employed a series of puzzles designed with increasing difficulty. One notable example used was the Tower of Hanoi.
The Tower of Hanoi is a classic mathematical puzzle involving three pegs and a number of disks arranged in diminishing size to form a pyramid. The task requires moving all disks from the leftmost peg to the rightmost one, adhering to the rule that no larger disk may rest atop a smaller one. Although considered a straightforward challenge, it is frequently presented to children aged six to 15.
Mathematical puzzles evaluated by reasoning models
Photo Credit: Apple
In this investigation, Apple researchers selected two reasoning models along with their non-reasoning counterparts for a comparative analysis. The LLMs included Claude 3.7 Sonnet and DeepSeek-V3, while the LRMs were Claude 3.7 Sonnet with Thinking and DeepSeek-R1, each with a maximum “thinking budget” of 64,000 tokens. The experiment aimed not solely at determining final accuracy but also assessing the logical accuracy of the steps taken to solve the puzzles.
The low complexity task featured up to three disks, whereas medium complexity involved four to ten disks, and high complexity tasks included between eleven and twenty disks.
Findings revealed that both LLMs and LRMs performed comparably well in solving low complexity tasks. However, as the challenges increased, reasoning models began to demonstrate greater accuracy due to their additional computing capacity. Yet, once they encountered high complexity tasks, both models showed a total breakdown in reasoning capabilities.
The research team indicated that they replicated the study with additional models and various puzzles, including Checkers Jumping, River Crossing, and Blocks World.
The conclusions drawn in Apple’s paper echo concerns voiced by others in the artificial intelligence sector. Despite their ability to generalize within their learned datasets, reasoning models often falter when faced with problems outside of their training scope, resorting to shortcuts or abandoning the problem altogether.
“Current evaluations mainly focus on established mathematical and coding benchmarks, emphasizing the accuracy of final answers. However, this evaluation approach is frequently hindered by data contamination and falls short of providing insights into the structure and quality of reasoning processes,” the company stated in a post accompanying the research.