Opus 4.8 Scored 81. Your Workflow Doesn't Care

artesia · 3 June 2026 14:00

The video critiques Anthropic’s Opus 4.8 as an incremental update rather than a breakthrough, highlighting its inconsistencies and less mature product infrastructure compared to OpenAI’s more reliable and efficient models like GPT-5.5 with Codex. It emphasizes that effective AI use depends more on robust harnesses and workflow design than solely on model capabilities, urging organizations to focus on system integration and flexibility amid ongoing competition in the AI landscape.

artesia · 3 June 2026 14:21

The video challenges the common narrative around Anthropic’s Opus 4.8 release, emphasizing that it is not the groundbreaking model many anticipated, especially compared to the much-anticipated Mythos model. The release was largely timed to coincide with a funding announcement, serving as a checkpoint to demonstrate ongoing progress rather than a leap forward in AI capabilities. While Opus 4.8 shows improvements in handling longer tasks and staying on task better than its predecessor 4.7, it falls short of being a reliable daily driver due to inconsistencies, particularly when scaling reasoning efforts, where higher reasoning modes sometimes produce worse results.

A key issue with Opus 4.8 is its tendency to “overthink,” especially in max reasoning mode, where it excessively focuses on alignment and constitutional constraints, which can reduce its effectiveness and predictability. This overthinking is linked to Anthropic’s philosophical approach to AI alignment, which, while admirable, can hinder practical performance. In contrast, OpenAI’s models, particularly GPT-5.5 with Codex, demonstrate more predictable improvements when scaling reasoning effort, making them more dependable for complex, long-running tasks.

The video highlights the critical role of “harnesses”—the surrounding product infrastructure that enables models to perform effectively in real-world tasks. Despite Opus 4.8’s strengths in writing and front-end design, its harness (Claude Code) is less mature compared to OpenAI’s Codex harness, which offers superior ergonomics, multitasking, and integration capabilities. For example, in practical tests involving building websites, GPT-5.5 with Codex outperformed Opus 4.8 by completing tasks faster, handling multiple threads simultaneously, and better managing file access and automation.

An important innovation in Opus 4.8 is the introduction of the /workflows command in Claude Code, which allows dynamic composition and transparent management of multi-agent workflows. This feature is seen as a promising direction for agent-based productivity tools in 2026, though it also underscores the complexity of managing agentic pipelines at scale. The video stresses the need for organizations to design agentic workflows that minimize human bottlenecks and enable agents to autonomously handle tasks like code reviews and production monitoring, with humans overseeing rather than micromanaging.

In conclusion, the video advises users and organizations to prioritize the harness and overall system design over simply choosing the “best” model. While Opus 4.8 is strong in certain areas, its unpredictability and less mature harness limit its utility as a daily driver compared to OpenAI’s offerings. The AI landscape in 2026 is a two-horse race with ongoing competition, and flexibility in tooling and model choice is crucial. The video encourages readers to explore detailed guides and tests on Substack to determine the best fit for their specific needs and to prepare for the arrival of even more powerful open-source models later in the year.