Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI

artesia · 20 February 2026 17:01

The video reviews Gemini 3.1 Pro’s release and explains how AI model performance now varies widely across domains due to specialized post-training, making traditional benchmarks less universally meaningful. It highlights that while models like Gemini 3.1 Pro are reaching or surpassing average human performance in some areas, issues like hallucinations persist, and future AI evaluation will likely focus more on speed, realism, and user-specific benchmarks.

artesia · 20 February 2026 17:21

The video discusses the release of Gemini 3.1 Pro, a new AI model, and explores why opinions about the “best” AI model are so contradictory across platforms. The creator explains that the confusion stems from how AI models are now trained: while pre-training on internet-scale data is still important, most of the compute is now spent on post-training, where models are fine-tuned on specific domains using internal benchmarks and industry data. This means that a model’s performance can vary greatly depending on the domain, making traditional benchmarks less universally meaningful than before.

The presenter illustrates this shift with examples, such as chess benchmarks and coding tests, showing that models like Claude Opus 4.6 and Gemini 3.1 Pro can outperform each other in different areas. For instance, Gemini 3.1 Pro excels in certain reasoning and pattern recognition benchmarks, but may lag behind in broader expert task evaluations like GDP Vow. The video also highlights how even within a single benchmark, small changes in question format can significantly affect model performance, revealing that models often exploit shortcuts rather than demonstrating true understanding.

A significant milestone is noted: on the creator’s private “Simple Bench” test, Gemini 3.1 Pro achieved a score close to the average human baseline in English text-based reasoning. This suggests that, at least in some domains, current frontier models are now matching or surpassing average human performance. However, the video cautions that models are still prone to taking shortcuts and that performance can drop when questions are made more open-ended, though the overall trend is clear improvement.

The issue of hallucinations—when models confidently provide incorrect information—is addressed, with Gemini 3.1 Pro performing well overall but still hallucinating in about half of its incorrect answers. The presenter notes that hallucinations remain an unsolved problem, and that model providers are less eager to highlight this metric. The discussion also touches on the limitations of current benchmarks, the potential for gaming predictive markets with AI agents, and the challenges of creating truly objective measures of general intelligence.

Finally, the video looks ahead to the future of AI evaluation, suggesting that speed, realism, and user-specific benchmarks will become increasingly important. The presenter demonstrates the impressive speed of Gemini 3.1 Pro and compares advances in video generation models, noting clear improvements in realism. The video concludes by emphasizing the ongoing debate about how to best measure general intelligence in AI and invites viewers to reflect on the implications of these rapid advancements.