OpenAI’s o3 Pro: Crushing The AI Game Test! 🎮

artesia · 20 June 2025 15:18

The video showcases OpenAI’s o3 Pro outperforming other advanced AI models in classic games like Tetris, Super Mario, and Sokoban by demonstrating superior planning, strategic thinking, and problem-solving abilities. It highlights that games offer a richer benchmark for AI evaluation, revealing emerging capabilities such as transferable spatial reasoning and adaptability beyond traditional tests.

artesia · 20 June 2025 15:39

The video explores the performance of various advanced AI models, including OpenAI’s o3 Pro, in playing classic games like Tetris, Super Mario, Sokoban, and Candy Crush, moving beyond traditional AI benchmarks. While many models such as Llama 4 and DeepSeek R1 show some initial promise, they struggle to consistently clear lines or complete levels, often collapsing or failing after a short time. The point system used rewards placing pieces before losing, highlighting that many AIs only manage to delay failure rather than achieve true success. However, OpenAI’s o3 Pro stands out by demonstrating genuine planning and strategic thinking, clearing multiple lines in Tetris and maintaining performance throughout the tests.

In Super Mario, the AI performances vary widely. GPT 4o underperforms, while Claude 3.5 shows some clever moves like finding hidden blocks but ultimately fails by falling into pits. Claude 3.7 performs better, navigating obstacles and enemies with more skill, even showing moments that resemble human gameplay, such as making impressive runs but then unexpectedly messing up simple tasks. Despite these ups and downs, the o3 Pro again proves superior, often outperforming other models by a significant margin and showing a more consistent ability to progress through levels.

Sokoban, a logic-based puzzle game requiring spatial reasoning and planning, further highlights the differences between AI models. While Gemini 2.5 manages to complete the first level, it quickly falters on subsequent ones due to poor planning. In contrast, OpenAI’s o3 demonstrates advanced foresight by avoiding common traps and successfully pushing boxes in the correct order, solving multiple levels before eventually stalling. The o3 Pro version improves even further, completing all six tested levels and showcasing a leap in problem-solving capabilities.

The video also discusses the challenges these AI models face, particularly their slow decision-making speeds, which stem from the nature of the tasks being outside their original design scope. The researchers developed a textual game harness that feeds game states to the AIs step-by-step, asking them what to do next. This approach allows the models to engage with complex games like Ace Attorney as well. Despite the slow pace, this method reveals important insights into AI capabilities, including emerging strategic planning and adaptability that traditional benchmarks fail to capture.

Three key lessons emerge from the experiments: first, large AI models are beginning to exhibit genuine planning and strategic thinking; second, games provide a richer, more challenging environment for evaluating AI than conventional benchmarks; and third, training on one game like Sokoban can improve performance on a different, unseen game like Tetris by up to 8%, indicating transferable spatial reasoning skills. These findings suggest a significant step forward in AI intelligence, with OpenAI’s o3 Pro exemplifying these advances by successfully completing complex game levels and demonstrating a new level of AI capability.