Apple's STUNNING Discovery "LLMs CAN'T REASON" | AGI cancelled

The video explores Apple’s critique of large language models’ reasoning abilities, highlighting that they perform well on simple tasks but struggle with complex problems, often recognizing their own limitations rather than genuinely reasoning. It questions whether these limitations reflect true reasoning deficits or are due to training data and model design, ultimately debating what constitutes reasoning in AI.

The video discusses Apple’s recent research paper titled “The Illusion of Thinking,” which examines the reasoning capabilities of large language models (LLMs). The presenter highlights Apple’s cynical stance towards current AI models, noting that Apple does not have any publicly known reasoning models and their AI products are considered among the worst. Despite this, Apple is publishing papers that critique the reasoning abilities of existing models, suggesting that these models are merely guessing machines rather than genuinely reasoning entities.

The core of the paper’s findings is that large reasoning models perform well on low to medium complexity tasks but fail completely on high complexity problems. For simple tasks like basic arithmetic, models often outperform or match non-reasoning models because these answers are straightforward or memorized. For more complex tasks that require step-by-step reasoning, models show an advantage when given time to think through problems. However, beyond a certain complexity threshold—such as solving the Tower of Hanoi with many disks—both types of models collapse and cannot produce correct solutions, leading to the conclusion that their reasoning capabilities are limited.

The presenter critically examines the methodology of the Apple paper, pointing out that many of the puzzles used, like the Tower of Hanoi, are well-known and likely present in the training data of these models. This raises questions about whether the models are genuinely reasoning or simply recalling solutions from memory. Additionally, experiments show that when models are asked to generate solutions for highly complex problems, they often decide that the problem is too difficult and give up, rather than reasoning through all the steps. This behavior is interpreted as the models recognizing their own limitations rather than lacking reasoning ability.

Counterpoints are discussed, especially from critics like Sean Goudie, who argue that the tests used—such as puzzles and mathematical problems—may not be the best indicators of reasoning. They suggest that models might already know the algorithms or solutions from training data, making the tests less meaningful. For example, when asked to solve the Tower of Hanoi with many disks, models often realize the problem is too large to output all steps and instead seek shortcuts or solutions that avoid explicit reasoning. This behavior is seen as a sign of practical reasoning rather than a failure.

In conclusion, the video questions whether the limitations observed in these models truly reflect an absence of reasoning or are simply a result of the models’ design and training data. The presenter emphasizes that models like GPT can generate solutions to complex problems, such as coding a Tower of Hanoi solver, which contradicts the idea that they only guess. Ultimately, the discussion raises broader questions about what reasoning and thinking mean in AI, whether current models imitate human reasoning effectively, and if Apple’s critique is a sign of progress or stagnation in AI development.