This Test Was Built to Block AI — GPT-5 Finally Passed It

The video explains how Poetic’s system using GPT-5 has surpassed human-level performance on the challenging ARC AGI 2 benchmark by employing advanced reasoning strategies and a managerial layer to orchestrate problem-solving. While this marks a major leap in AI capabilities, true artificial general intelligence remains out of reach, as current models still lack autonomous adaptability and exploration.

The video discusses a significant milestone in AI development: GPT-5, using Poetic’s system, has surpassed human-level performance on the ARC AGI 2 benchmark. This benchmark is designed to test abstract reasoning, generalization, pattern discovery, and compositional reasoning—skills that go beyond memorized knowledge or dataset familiarity. Traditionally, large language models (LLMs) have struggled to exceed the average human score of about 60% on this test, but Poetic’s version of GPT-5 achieved around 75%, marking a notable leap in AI capabilities.

A key concept explored in the video is “unhobbling,” which refers to removing artificial limitations from AI models to unlock their full potential. The idea, highlighted in a 2024 paper by Leopold Aschenbrenner, is that many AI models are held back by constraints in how they process and solve problems. For example, earlier LLMs would answer complex questions in a single step, whereas humans typically work through problems methodically. Techniques like chain-of-thought prompting and system-level scaffolding have allowed models to reason more like humans, leading to substantial performance improvements.

Poetic’s approach involves adding a “manager” layer on top of the base LLM. This manager decides which model to use, how to break problems into steps, when to write code, and when to stop if a solution is good enough. This system-level intelligence enables more efficient and reliable problem-solving, avoiding the wastefulness of single-shot answers from large models. The video emphasizes that these gains are not just about bigger models, but about smarter systems that orchestrate the reasoning process.

The creator of the ARC AGI benchmark, François Chollet, is quoted to clarify that passing ARC AGI 2 does not mean we have achieved true artificial general intelligence (AGI). The benchmark is a minimal test of fluid intelligence, requiring models to reason through novel problems rather than relying on memorized patterns. While current AI systems can now match or exceed average human performance on these tests, they still lack the ability to autonomously explore, model, and adapt to entirely new environments—capabilities that future benchmarks like ARC AGI 3 and beyond will aim to test.

In conclusion, the video highlights that AI progress is accelerating faster than many realize, largely due to “unhobbling” strategies that unlock latent capabilities in existing models. While reaching human-level performance on ARC AGI 2 is impressive, true AGI will require further advances in autonomous reasoning and adaptability. The field is rapidly evolving, and future benchmarks will continue to push the boundaries of what AI systems can achieve.