The video highlights the impressive advancements of the local Quinn3 Coder 30B LLM, showcasing its superior coding capabilities, speed, and tool-calling accuracy compared to other local models, making it a strong contender against cloud-based solutions like OpenAI’s Codex. Despite some minor limitations, the presenter demonstrates practical applications and optimizations that establish local LLMs as viable, powerful tools for developers seeking efficient, offline AI coding assistance.
The video provides an in-depth exploration of the current state of local large language models (LLMs) for coding, focusing primarily on the Quinn3 Coder 30B A3B model, a mixture of experts model that the presenter finds exceptionally powerful. The presenter begins by showcasing various UI examples generated by different local models, noting that Quinn3 Coder Q5 produces the best results by a significant margin compared to others like GPT OSS 20B and Devstral Small 257. Initial versions of Quinn3 Coder, such as Q4_0 and Q5_0, were disappointing, but improvements in quantization methods, especially the newer XL quantization, have greatly enhanced performance and reliability.
The presenter discusses technical details about running these models locally, highlighting the importance of managing GPU resources effectively. Using an RTX 5090 with 32GB VRAM, they explain how enabling features like flash attention and specific quantization types (KC cache, Vcache) significantly reduces video memory usage with minimal performance loss. Speed is a critical factor, and the presenter shares benchmark results showing Quinn3 Coder 30B achieving impressive token-per-second rates (around 177 TPS) with very low latency, outperforming Devstral Small and GPT OSS 20B in both speed and token generation volume.
A significant portion of the video is dedicated to evaluating the models’ tool-calling capabilities, which are essential for coding tasks. The presenter differentiates between prompt-based tool calls (used by Root Code) and native function calls (used by Open Code), noting that Quinn3 Coder excels in tool calling accuracy and frequency, although it tends to be “tool call happy,” sometimes calling more tools than expected. Devstral Small performs well in simpler scenarios but struggles with more complex chains. GPT OSS 20B shows near-perfect structural accuracy but lower semantic accuracy and struggles with prompt-based tool calls, performing better with native calls.
The presenter also shares insights into customizing LM Studio settings to optimize model performance, such as overriding templates to fix errors and tuning parameters like temperature, top-k, and repeat penalty for better results. They reveal a custom-built tool for stress-testing tool call reliability across hundreds of chained calls, which provides valuable data on model behavior and accuracy. The evaluation scores for Quinn3 Coder and GPT OSS 20B are particularly impressive, reaching levels comparable to earlier versions of OpenAI’s Codex models, suggesting that local models are rapidly closing the gap with cloud-based solutions.
In conclusion, the presenter is highly enthusiastic about the capabilities of Quinn3 Coder 30B, especially for real-world coding tasks. They demonstrate practical use cases, such as analyzing and refactoring code files with high accuracy and speed, and even showcase interactive demos like a first-person shooter and a music visualizer running entirely on local models. While acknowledging some limitations, such as context window size and occasional UI glitches, the overall impression is that local LLMs for coding have reached a new level of maturity, making them viable tools for developers who want powerful AI assistance without relying on cloud services. The presenter invites viewers to share their experiences and suggestions, expressing excitement about further exploring and tuning these models.