OpenAI FINALLY releases open weight models, but can they actually code?

merefield · 7 August 2025 03:15

The video reviews OpenAI’s newly released open-weight models, praising their impressive speed and large context windows enabled by a mixture of experts architecture, but finds their coding capabilities lacking compared to established AI coding tools like Quinn3 Coder and Claude. Despite potential for chat-based idea generation, the models struggle with reliability and completeness in practical coding tasks, making them currently unsuitable as primary coding assistants.

merefield · 7 August 2025 03:36

In this video, the creator discusses OpenAI’s recent release of two open-source models: a 117 billion parameter model (referred to as 120B) and a 21 billion parameter model (referred to as 20B). The creator has tested multiple versions of these models locally, appreciating their impressive speed and large context windows, with up to 100,000 tokens. Despite these technical strengths, the creator finds the models underwhelming for agentic coding tasks, placing their coding capabilities significantly below other models like Quinn3 Coder and Claude, which remain superior for daily coding use.

The video delves into the technical aspects of these models, highlighting that they use a mixture of experts (MoE) architecture. This means that instead of activating all parameters for every token, a router selectively activates certain experts, enabling faster token processing. The creator finds this technology fascinating and praises the models’ speed, achieving over 150 tokens per second locally. However, despite reasonable quality in chat-based interactions and idea generation, the models fall short when integrated with existing AI coding tools, often producing broken or incomplete code.

A significant portion of the video focuses on the challenges faced when using these models through Open Router providers. The creator notes high variability in performance depending on the provider and temperature settings, with some providers like Grock running “hotter” (higher temperature) by default, affecting output quality. Additionally, frequent tool call errors and failures to complete tasks plague the models, making them unreliable for consistent coding assistance. The creator expresses skepticism about Open Router’s ability to overcome these issues given the complexity of model quantization and inference differences.

Through practical tests, including a personal portfolio project and a simple Flappy Bird game, the creator demonstrates the models’ limitations. The 20B model often produces white screens or incomplete HTML, while the 120B model struggles with missing data, styling issues, and API errors during edits. These experiences reinforce the creator’s view that these open-source models are not yet suitable for daily coding workflows, especially when compared to more mature AI coding tools that provide reliable and functional outputs.

In conclusion, while the creator appreciates the release of these fast, large-context open-source models and acknowledges their potential for chat-based idea generation, they remain unconvinced of their practical utility for agentic coding tasks. They suggest that these models might be best used for experimentation or as supplementary tools rather than primary coding assistants. The creator is open to revisiting these models in the future but currently favors other solutions for daily coding needs and encourages viewers to share their opinions and experiences in the comments.