How I Got a 32GB Local LLM to Run on My 28GB System Memory PC

The creator demonstrates running a 32GB Mixture of Experts (MOE) large language model on a PC with only 28GB of system memory by streaming model components from an NVMe SSD using memory mapping, activating only a subset of experts at a time to reduce active memory usage. This approach, leveraging fast SSD read speeds and efficient hardware allocation, enables smooth inference on mid-range hardware without loading the entire model into RAM, showcasing a practical method to run large local LLMs on consumer-grade systems.

In this video, the creator demonstrates how they managed to run a 32GB local large language model (LLM) on a PC with only 28GB of total system memory, a feat that initially seems impossible. Inspired by a technique shared by Daniel Isaac on Twitter, the key idea is to stream parts of the AI model from the SSD rather than loading the entire model into memory at once. This approach leverages the fast read speeds of modern SSDs to load only the necessary model components on demand, significantly reducing memory requirements.

The experiment focuses on Mixture of Experts (MOE) models, which differ from traditional AI models by activating only a subset of specialized “expert” components for each task. Instead of firing all neurons simultaneously, MOE models selectively engage a few experts out of many, drastically reducing the active memory footprint. The creator uses a 30 billion parameter MOE model with 128 experts, where only eight are active at any time, allowing the bulk of the model to remain on slower storage and be streamed as needed.

The hardware setup includes an Nvidia RTX 3060 GPU with 12GB VRAM, 16GB of system RAM, and a 2TB NVMe SSD. The GPU handles the core attention and routing logic, while the CPU RAM acts as a cache for frequently used experts. The SSD stores all expert weights, which are memory-mapped using the mmap technique, enabling the operating system to load only the required parts into RAM dynamically. Despite some bottlenecks due to a suboptimal SSD adapter limiting read speeds to 1.5GB/s, the system successfully streams model components from the SSD during inference.

Using the open-source llama.cpp inference engine, the creator configures the model to place attention layers on the GPU and expert weights on the CPU, enabling efficient streaming. Real-time monitoring shows how the system reads from the SSD when new experts are needed and caches frequently used ones in RAM. Although the model runs slower than smaller models fully loaded in memory, it operates smoothly without crashing the system, demonstrating that large MOE models can be run on mid-range hardware with clever streaming techniques.

In conclusion, this experiment proves that SSD expert streaming can enable running large local LLMs on hardware with less memory than the model size, extending Daniel Isaac’s Apple-based method to Windows PCs. The creator shares their custom monitoring tools and benchmarks on GitHub for others to try. This approach highlights a future where smarter streaming and memory management, rather than just bigger GPUs, will allow broader access to powerful AI models on consumer-grade machines.