Top AI Coding Assistants with Gemini 2.5 Pro... Wild variance

merefield · 9 July 2025 13:01

The video evaluates various AI coding assistants paired with Gemini 2.5 Pro, revealing significant performance variance and highlighting the importance of model-specific tuning for consistent, high-quality code generation and bug fixing. Top performers like Z AI, Cursor, and Windsurf demonstrated strong synergy with Gemini 2.5 Pro, while others struggled with reliability and task completion, emphasizing the model’s potential when effectively integrated.

merefield · 9 July 2025 13:21

The video presents an in-depth evaluation of various AI coding assistants tested with the Gemini 2.5 Pro model, focusing primarily on their ability to follow detailed instructions, fix bugs, and generate functional code. The testing process was complex and required multiple runs to establish reliable baselines due to significant variance in performance. The evaluation included static code analysis, unit testing, and using an LLM judge (Claw for Sonnet) to score outputs consistently, with variance typically under 5% for the judge itself. The presenter also introduced bug-fixing tasks on existing codebases to assess the assistants’ editing capabilities.

Among the AI coding assistants tested, Windsurf ranked third but showed highly inconsistent results, with scores ranging widely across runs. Cursor secured second place with a significant jump in points over Windsurf, indicating strong tuning for Gemini 2.5 Pro. The top performer was Z AI, which delivered consistently high scores, though still below the Claw for Sonnet benchmark. Other assistants like Kilo Code, Root Code, and Client performed similarly, with minor differences within the margin of error. IDER was notable for its cost-effectiveness and solid performance, despite some limitations in self-testing capabilities.

The video highlighted the wide variance in performance across different runs for many assistants, making it challenging to judge their true effectiveness. Some, like Windsurf, exhibited extreme fluctuations, while others like Cursor maintained more stable scores. Several assistants, including Open Code and Void 2.5 Pro, struggled with issues such as infinite loops or failure to complete tasks, rendering them unreliable with Gemini 2.5 Pro at the time of testing. The presenter emphasized the importance of tuning AI coding assistants specifically for the model they are paired with, as this greatly impacts their performance and consistency.

A key insight from the testing was that some assistants excelled by performing self-testing and iterative improvements on their code, which correlated with higher scores. Conversely, others were described as “lazy,” often producing incomplete or stubbed-out code that was difficult to evaluate. The presenter also noted that temperature adjustments in Gemini 2.5 Pro had less impact on performance than in previous versions. Overall, the results suggest that while Gemini 2.5 Pro is powerful, its effectiveness heavily depends on the AI coding assistant’s integration and tuning.

In conclusion, the presenter expressed enthusiasm for Gemini 2.5 Pro, especially for large refactoring tasks where its large context window and instruction-following capabilities shine. However, the significant variance in results and the dependency on the coding assistant’s tuning make it a challenging model to evaluate and use consistently. The top-performing assistants—Cursor, Windsurf, and Z AI—demonstrated strong synergy with Gemini 2.5 Pro, while others lagged behind or failed to complete tasks. The video ends with an invitation for viewers to share their thoughts and encourages subscriptions for more content on AI coding tools.