Anthropic won. This is my new favorite model (sorry Gemini…)

artesia · 25 November 2025 08:56

The video reviews Anthropic’s Opus 4.5 model as a new favorite for coding tasks, praising its exceptional efficiency, reliability, and improved capabilities that surpass competitors like Gemini 3 Pro and GPT 5.1 in practical use. Despite some limitations and safety concerns, the model’s advanced coding performance, token efficiency, and enhanced UI generation mark a significant advancement, prompting the reviewer to reconsider Anthropic’s potential in the AI landscape.

artesia · 25 November 2025 09:20

The video presents a detailed review of Anthropic’s new model, Opus 4.5, which the creator surprisingly names as their new favorite for coding tasks, surpassing other recent groundbreaking models like Gemini 3 Pro and GPT 5.1. Despite initial skepticism towards Anthropic, the reviewer praises Opus 4.5 for its exceptional coding capabilities, efficiency, and reliability, noting that it has significantly improved token utilization and reduced costs compared to previous versions. The model excels in developer use cases, although its writing quality outside coding remains less impressive. The reviewer also highlights the sponsor, Kilo Code, which offers a versatile and open-source tool for integrating various language models into coding workflows, enhancing productivity through orchestrated multi-model setups.

Benchmark results for Opus 4.5 are impressive, with the model achieving state-of-the-art scores in several coding and problem-solving tests, including agentic tool use and novel problem-solving challenges. It performs comparably to GPT 5.1 and slightly below Gemini 3 Pro in some intelligence indexes but stands out for its token efficiency, using significantly fewer tokens than competitors like Sonnet 4.5 while maintaining or exceeding accuracy. However, the reviewer expresses skepticism about the reliability of benchmarks as sole indicators of model quality, emphasizing real-world usage experiences instead. Opus 4.5 also demonstrates improved image understanding and tool-calling capabilities, making it a versatile choice for complex coding tasks.

The video also critically examines Anthropic’s approach to benchmarking and safety claims, pointing out inconsistencies in how they handle novel problem-solving behaviors across different models. The reviewer shares insights from their own open-source SnitchBench tests, revealing that while Opus 4.5 shows reduced concerning behaviors compared to previous versions and competitors, safety assessments are nuanced and depend heavily on testing conditions. Despite some issues like occasional timeouts and limitations in UI capabilities compared to other models, Opus 4.5 shows marked improvements in reliability and tool usage, often working around broken toolchains to complete tasks effectively.

User interface (UI) generation capabilities of Opus 4.5 are notably better than previous Anthropic models, with the reviewer impressed by the quality and readability of the UI code it produces. While other models like Gemini and GPT 5.1 have made significant strides in UI generation, Opus 4.5 has caught up, delivering tasteful designs and functional layouts. The reviewer contrasts this with the often problematic UI experiences on Anthropic’s own platform, praising third-party tools like T3 Chat for providing more stable and flexible environments. Despite some quirks, Opus 4.5’s ability to handle complex coding projects and generate usable UI components marks a significant step forward.

In conclusion, the reviewer ranks Opus 4.5 as their top choice for coding due to its consistency, reliability, and output quality, placing it above Gemini 3 Pro and Sonnet 4.5 in practical use despite its higher cost. They acknowledge ongoing issues and the need for Anthropic to open-source their cloud code to foster greater trust and collaboration. The video ends with a call for community feedback on whether Opus 4.5 represents a new era for Anthropic or a modest improvement, emphasizing the model’s impressive advancements in coding performance and efficiency that have transformed the reviewer’s perspective from skeptic to admirer.