The video demonstrates running the GLM 5.1 large language model fully locally on a powerful 8-GPU AMD Epic 7702 system, achieving impressive token generation speeds around 12 tokens per second through careful thread balancing and optimized Llama C++ compilation. Despite challenges with the model’s large size, the presenter highlights effective hardware utilization, provides setup guidance, and encourages experimentation with local LLMs on high-end multi-GPU builds.
The video discusses running the GLM 5.1 large language model fully locally with decent performance on a powerful 8-GPU build. The presenter walks through the setup process, referencing a recent 8-GPU build video and an updated Llama C++ tutorial that allows compiling from source for optimal performance. They use the Unsloth UD Q2 KXL quantization, which reduces some model intelligence but still delivers surprisingly good results. The system used is an AMD Epic 7702 with 512 GB RAM, 64 cores, and 128 threads, running inside a Proxmox container with eight GPUs and 120 CPU cores passed through.
The presenter highlights the importance of thread balancing on NUMA-heavy systems like the AMD Epic, finding that 60 to 64 threads provide the best performance, while going above or below that range hurts speed. They also emphasize the role of the “fit on” feature in Llama C++ for efficient hardware utilization. Initial expectations for token generation speed were low, but the system achieved around 12 tokens per second, which is impressive for such a massive model. The presenter notes that CPU core count significantly impacts performance, with higher core counts yielding better results.
During testing, the model maintained a steady generation speed of about 11.5 to 12 tokens per second and prompt processing speeds close to 60 tokens per second, even with a large context window of 4096 tokens. The presenter ran various tests, including asking the model to count letters in a word and generate decimals of pi, confirming the model’s accuracy and responsiveness. GPU and CPU utilization were monitored throughout, showing efficient cooling and resource usage, with the presenter planning to add more GPUs to fit the entire model into VRAM for potentially better performance.
The video also touches on the challenges of running such a large model locally, noting that the 8-bit version is still very large at 85 GB and the 16-bit version is over a terabyte, making it difficult to run on less powerful hardware. Despite these challenges, the presenter praises the GLM team for releasing this open-source model and expresses curiosity about other projects like DeepSeek. They caution that hooking this model into certain harnesses may not be feasible due to its size and resource demands but encourage viewers with powerful setups to experiment with it.
In conclusion, the presenter recommends using the auto-fitting features in the latest Llama C++ for optimal performance and provides links to guides and tutorials on their website, digitalspaceport.com. They encourage viewers to explore local LLMs, experiment with thread settings, and follow their ongoing work with Hermes Agent and Proxmox setups. Overall, the video serves as a practical walkthrough and performance report for running GLM 5.1 locally on a high-end multi-GPU system, demonstrating that with the right hardware and configuration, large models can be effectively utilized outside of cloud environments.