Apple: “AI Can’t Think” — Is AI Reasoning Fake?

The video explains that Apple’s recent publication challenges the idea that large language models genuinely reason or think, suggesting their impressive performance may be inflated by data contamination and pattern matching. It highlights the use of controlled puzzle environments to better evaluate reasoning, revealing that current models struggle with complex tasks and that true generalizable reasoning in AI remains an ongoing challenge.

The video discusses Apple’s recent publication titled “The Illusion of Thinking,” which has gained significant attention online. In this paper, Apple challenges the notion that large language models (LLMs) with reasoning capabilities are truly thinking or reasoning in a human-like manner. Instead, Apple argues that these models are often overestimated in their abilities, partly due to data contamination—meaning they have been trained on the very benchmarks they are tested against—thus inflating their performance. The paper questions whether these models genuinely understand or reason, or if they are simply pattern matching and memorizing data.

Apple proposes a new approach to evaluating reasoning models using controllable puzzle environments, such as Tower of Hanoi, river crossing, checkers jumping, and blocks world puzzles. These puzzles allow for systematic variation in complexity by adding more elements, providing a more controlled and contamination-free way to assess reasoning. Unlike traditional benchmarks, these puzzles emphasize algorithmic reasoning and enable precise evaluation of the models’ problem-solving processes, including intermediate reasoning steps, rather than just final answers.

The core findings reveal that current state-of-the-art reasoning models perform well on simple puzzles but struggle as complexity increases. When puzzles reach a certain threshold, both “thinking” and “non-thinking” models fail, with performance dropping sharply. Interestingly, when given the same inference token budgets, non-thinking models can match the performance of thinking models by generating multiple candidate solutions and selecting the best, suggesting that the perceived advantage of reasoning models may be partly due to their use of more tokens and data contamination.

The analysis also uncovers limitations in the models’ reasoning efforts, such as overthinking and inefficient exploration of solutions, especially on more complex puzzles. Even when provided with explicit algorithms to solve the puzzles, models often fail to improve significantly, indicating fundamental issues in their reasoning and verification capabilities. The video highlights that these models can sometimes solve puzzles by writing code—an approach that demonstrates a form of reasoning and problem-solving that the paper’s authors did not fully consider, raising questions about how we evaluate intelligence and reasoning in AI systems.

Finally, the video concludes with reflections on the broader implications of these findings. While current models show impressive capabilities in many areas, their limitations in systematic reasoning at higher complexities suggest that true generalizable reasoning remains a challenge. The speaker also critiques the optimistic view that AI will soon achieve human-level intelligence, emphasizing that models excel in some tasks but still struggle with fundamental reasoning and verification. The discussion underscores the importance of developing more rigorous, contamination-free benchmarks and considering alternative measures of AI reasoning, such as code generation, to better understand and evaluate their true capabilities.