Best AI coding Agents with some interesting results | Sonnet 4.5, GLM 4.6, Grok Code, GPT 5 Codex

merefield · 1 October 2025 14:30

The video provides an in-depth evaluation of various AI coding agents like Sonnet 4.5, GLM 4.6, Grok Code, and GPT-5 Codex, focusing on their instruction-following abilities in complex multi-file coding tasks and highlighting their strengths, quirks, and best use cases. The creator plans to continue refining assessments with community input, favoring agents like Warp, Claude Code, and Codeex CLI for their own work while emphasizing the evolving and fast-paced nature of AI coding tools.

merefield · 1 October 2025 14:52

The video discusses the rapidly growing landscape of AI coding agents, highlighting the challenge of keeping track of the many options available. The creator mentions their plan to involve the community through monthly polls to identify which agents are most relevant, as interest shifts frequently. They emphasize that their evaluation focuses primarily on instruction-following ability—how well an agent can complete a coding task based on a detailed specification and test suite—rather than communication style or interaction quality. The tests involve complex multi-file projects, often requiring edits across 8 to 20 files, to better simulate real-world coding scenarios.

Several new models were introduced during the testing period, including GLM 4.6 and Sonnet 4.5, which required adjustments to the testing framework. The video also covers GPT-5 Codex and other agents, noting that while GPT-5 Codex shows promise, it has some inconsistencies and is best suited for specific ecosystems like Factory CLI (Droid) or Codec environments. The speaker points out that some agents, like Zed, tend to over-deliver by generating more than requested, which can be both a strength and a drawback depending on user preference. They also mention that reasoning or “thinking” modes in agents like Sonnet 4.5 did not significantly improve performance in their tests, which focus on straightforward instruction following rather than complex planning or debugging.

Claude 4.5 and Claude 4.6 (Sonnet 4.5) scored very well, with Claude 4.5 showing some quirks such as oscillating between solutions during iterative feedback. Despite this, the overall performance was strong, and some users report it feels faster, possibly due to parallel tool calls. Warp.dev, which integrates well with Claude models, also performed excellently and is considered a top choice. GLM 4.6 impressed as a cost-effective option, especially for students, and worked well with Root Code and Crush agents. Grok Code Fast remains popular due to its free availability, though it has some tool-calling issues that affect reliability.

The video also compares the performance of various agents using Grok Code Fast, noting that Cursor, GitHub Copilot, and Root Code provide solid experiences with this model. The speaker highlights that the differences in scores among top agents are often due to how much they self-iterate and refine their output rather than raw capability. They caution that some newer agents, like Droid, while visually appealing and functional, do not yet offer significant advantages over established options. The speaker expresses a preference for open-source agents like Root Code and Open Code for their configurability and flexibility, especially if subscription services become less viable.

In conclusion, the creator plans to focus on Warp, Claude Code, and Codeex CLI for their own work moving forward, appreciating the strengths and weaknesses of each. They acknowledge the difficulty in thoroughly testing every agent due to time constraints and the fast pace of development in this space. The video serves as a comprehensive snapshot of the current state of AI coding agents, offering insights into their relative strengths, quirks, and best use cases, while inviting community feedback and ongoing evaluation. The creator encourages viewers to subscribe and engage if they found the content helpful.