How to run agentic 35B models with only 8gb of vram (nvidia 4060ti)

artesia · 1 May 2026 19:14

Advancements like Mixture of Experts offloading and tools such as Llama.cpp now enable running large 35B parameter models like Qwen 3.6 efficiently on GPUs with only 8GB VRAM, such as the Nvidia RTX 4060 Ti, by offloading less critical components to system RAM. This approach, combined with optimizations like Q8 KV cache and flash attention, allows practical AI workloads with large context windows on affordable hardware, democratizing access to agentic AI development without requiring expensive high-end GPUs.

artesia · 1 May 2026 19:34

In 2026, running large AI models like the agentic 35 billion parameter Qwen 3.6 on GPUs with only 8GB of VRAM, such as the Nvidia RTX 4060 Ti, has become feasible thanks to recent advancements. Previously, it was widely believed that serious local large language model (LLM) work required GPUs with at least 24GB of VRAM or multiple GPUs. However, new techniques and optimizations have challenged this notion, making it possible to run these models efficiently on more affordable and accessible hardware.

A key breakthrough enabling this is the use of Mixture of Experts (MoE) offloading, which strategically places only the most critical parts of the model on the GPU while offloading less essential components to system RAM. This approach allows the 35B parameter Qwen 3.6 model to activate only about 3 billion parameters per token on the GPU, keeping attention mechanisms and shared weights on the GPU and moving cold expert feed-forward networks to system memory. This method, combined with tools like Llama.cpp, which offers high flexibility and tunability, makes it possible to run large models on limited VRAM setups.

Performance benchmarks shared by Above Spec demonstrate that the RTX 4060 Ti with 8GB VRAM can achieve around 41 tokens per second with a 16K context window and 24 tokens per second at a 200K context window. These speeds are sufficient for practical use cases, especially when running AI agents like Hermes. The use of Q8 KV cache and flash attention further optimizes memory usage, allowing large context windows to fit within just a couple of gigabytes of VRAM, which is crucial for agentic AI tasks that require extensive context.

While the 4060 Ti is highlighted as a strong option, the RTX 3070 8GB is also mentioned as a viable alternative due to its higher memory bandwidth, although it is generally slower and less cost-effective. The video emphasizes the importance of pairing these GPUs with a decent CPU and ample system RAM (around 64GB) to handle the offloading efficiently. Additionally, the affordability and availability of these GPUs, especially on platforms like eBay, make them attractive choices for users looking to run large AI models locally without investing in high-end, expensive hardware.

In conclusion, although having more VRAM is still beneficial for the best experience, the advancements in model offloading and optimization techniques have democratized access to running large AI models locally on budget-friendly GPUs with only 8GB of VRAM. This opens up new possibilities for AI enthusiasts and developers who want to experiment with agentic AI without breaking the bank. The video encourages viewers to consider these options and share their thoughts on which GPUs they plan to use for their AI projects.