Can a Local LLM REALLY be your daily coder? Framework Desktop with GLM 4.5 Air and Qwen 3 Coder

merefield · 27 August 2025 13:45

The creator tests using local large language models like GLM 4.5 Air and Qwen 3 Coder on a powerful Framework Desktop for daily coding, finding that while local LLMs can be effective, performance bottlenecks and tooling limitations require a hybrid approach combining larger, slower models for planning and smaller, faster ones for coding. Despite challenges like slow prompt processing and occasional errors, the experiment shows promise for fully local AI-assisted coding with further hardware and software improvements.

merefield · 27 August 2025 14:06

In this video, the creator takes on the challenge of using only local large language models (LLMs) for daily coding tasks, aiming to avoid any cloud-based services and associated costs. They demonstrate their setup using a Framework Desktop with 128 GB of unified memory, allocating 96 GB to VRAM, and an RTX 5090-equipped Windows machine. The goal was to test whether local models like GLM 4.5 Air, Qwen 3 Coder 30B, and GPT OSS 120B could provide a pleasant and efficient coding experience, especially focusing on handling large context windows and longer-running tasks.

Throughout the day, the creator experimented with various models and configurations, discovering that prompt processing speed and memory bandwidth were significant bottlenecks. They found that tuning parameters such as evaluation batch size had a notable impact on performance, with a batch size of 2048 offering a good balance between speed and memory usage. Additionally, techniques like flash attention and different quantization methods (Q8, Q5) influenced latency and VRAM consumption. The choice of runtime also mattered, with Vulkan outperforming AMD’s Rockm runtime on the Framework Desktop due to memory allocation issues.

Despite some promising results, the creator encountered several challenges. GPT OSS 12B was slow and sometimes produced duplicated code, while GPT OSS 120B struggled with timeouts and errors during agentic coding tasks. GLM 4.5 Air, although capable of decent token-per-second rates, did not support native tool calling, limiting its usability in certain coding workflows. The initial prompt processing times were often very long, sometimes taking minutes, which made interactive coding with large models frustrating and inefficient.

To overcome these issues, the creator shifted to a “back to basics” approach, using simpler tools like Jan AI for more straightforward code editing and relying on smaller, faster models like Qwen 3 Coder 30B on the RTX 5090 for grunt work. They found that combining slower, larger models for planning and understanding code with faster, smaller models for direct coding tasks was the most practical setup. Although this approach required more manual coding and iteration, it allowed them to work entirely locally without cloud dependencies.

In conclusion, the video highlights both the potential and current limitations of using local LLMs for daily coding. While it is possible to run large models locally and get meaningful work done, performance constraints, prompt processing delays, and tooling limitations mean that a hybrid approach using multiple models and workflows is currently the most effective. The creator invites viewers to share their thoughts and suggestions for improving local AI coding setups, emphasizing that with continued experimentation and hardware improvements, fully local AI-assisted coding could become more viable in the future.