How this tiny model beat ChatGPT on the “AGI” benchmark [HRM & TRM]

artesia · 4 December 2025 17:03

Contrary to the prevailing belief that larger models are superior, tiny recursive models like the Hierarchical Reasoning Model (HRM) and the Tiny Recursion Model (TRM) have outperformed much bigger language models on the ARC AGI benchmark by iteratively refining their answers through multi-step reasoning. These models demonstrate that efficient, small-scale architectures with recursive thinking can achieve better logical reasoning and generalization, challenging the notion that scaling up is the only path to Artificial General Intelligence.

artesia · 4 December 2025 17:23

The prevailing belief in AI development has been that scaling up models—making them bigger and more complex—is the key to achieving Artificial General Intelligence (AGI). Massive investments have been poured into building language models with billions or even trillions of parameters, like GPT-4, in hopes of reaching this milestone. However, despite their impressive capabilities, these large language models (LLMs) still struggle with certain logical puzzles, such as those posed by the ARC AGI benchmark, which tests pattern recognition and reasoning in a highly structured domain. Challenging this scaling-centric approach, a new line of research has emerged focusing on smaller models that use recursive reasoning to iteratively refine their answers until confident, rather than attempting to solve problems in a single forward pass.

This new approach began with the Hierarchical Reasoning Model (HRM), a surprisingly small model with only 27 million parameters—about 35,000 times smaller than GPT-4. Despite its size, HRM achieved a remarkable 32% score on the ARC AGI benchmark and solved Sudoku puzzles at levels beyond the reach of current large language models. HRM operates using two transformer networks running at different speeds: a fast low-level network that makes small iterative updates to a latent “scratch pad,” and a slower high-level network that makes strategic adjustments and decides when to halt the reasoning process. This design mimics the brain’s multi-time scale processing and allows the model to allocate computational effort dynamically, focusing more on complex parts of a problem.

While HRM’s results were impressive, some researchers criticized its reliance on heuristics and biological analogies that were not rigorously proven. The training method assumed that the fast inner loop would reach a stable equilibrium, which often did not happen in practice, potentially leading to inaccurate gradient estimates. To address these issues, a follow-up model called the Tiny Recursion Model (TRM) was developed. TRM discarded the equilibrium assumption and instead trained on the actual number of recursive steps it performed, refining its answers through a fixed number of iterations. TRM is even smaller than HRM, with only 7 million parameters, yet it outperformed HRM by scoring 40% on ARC AGI 1 and 6.2% on ARC AGI 2, surpassing several larger models including Gemini 2.5 Pro.

TRM’s design emphasizes a clear functional separation between a “scratch pad” for iterative thinking and an “answer placeholder,” avoiding interference between these two memory states. Interestingly, TRM performs better with fewer layers and more recursion, contrary to the trend in large language models where bigger and deeper networks generally perform better. This is likely because the training data for these logical tasks is limited, and smaller models are less prone to overfitting. The recursive process breaks down complex reasoning into manageable steps, allowing the model to generalize better despite its small size. Additionally, TRM adapts its architecture depending on the task complexity, using simpler MLPs for smaller puzzles like Sudoku and attention mechanisms for larger, more complex grids.

Overall, these recursive models challenge the dominant paradigm that bigger is always better in AI. They demonstrate that small, specialized models with iterative reasoning capabilities can outperform much larger language models on specific logical reasoning benchmarks. This opens up exciting possibilities for more efficient and interpretable AI systems that do not rely solely on brute-force scaling. The success of HRM and TRM suggests a promising new direction for AI research, focusing on recursive thinking and refinement rather than sheer size. The future may hold even more powerful recursive models that combine the best of both worlds: efficient reasoning and generalization without the massive computational costs of today’s largest LLMs.