EPYC vs Threadripper Local Ai Benchmarks on Ollama

The video compares the performance of AMD EPYC and Threadripper CPUs in CPU inference tasks, highlighting how higher system bandwidth (200-400 GB/s) significantly improves token generation speeds during AI model inference. It emphasizes that system bandwidth is a crucial factor for inference performance, often more impactful than core count, and demonstrates that modern hardware with increased bandwidth can greatly enhance AI workloads without GPU acceleration.

The video compares the performance of two high-end CPUs, the AMD EPYC and AMD Threadripper, in CPU inference tasks, focusing on how system bandwidth impacts token generation speed. The host explains that system bandwidth is a key factor influencing inference performance, with the EPYC 7702 and AMD’s 7995WX being used as examples. The EPYC has about 200 GB/s of system bandwidth, while the 7995WX offers roughly double that at around 400 GB/s, and these differences are reflected in the token generation rates during testing.

The host runs various models, including Gemma 31 12B, Mistl 3.1, and Jimma 3, on both systems, observing how their performance varies. For instance, on Gemma 31 12B, the Threadripper achieved response tokens at around 16.28/sec, while the EPYC managed about 10.3/sec, with prompt tokens also showing similar disparities. These tests are conducted with identical settings and seed values to ensure fair comparisons, and the systems are running purely on CPU inference without GPU acceleration.

Further tests with Mistl 3.1 and Jimma 3 reveal consistent trends: systems with higher bandwidth generally produce better token throughput. The host notes some anomalies, such as Jimma 3 performing unexpectedly poorly on the 4B model, suggesting potential issues with how certain models or configurations are processed. The tests also highlight that the Threadripper Pro, despite having many cores, exhibits some hiccups and inconsistent performance, possibly due to tuning or system configuration factors.

The host emphasizes that system bandwidth remains a crucial performance metric for inference tasks, especially for smaller models. Most upcoming hardware, including AMD and Nvidia products, aim for bandwidths in the 200-400 GB/s range, which can serve as benchmarks for expected performance. The video also compares these to older systems like the Z440, which peaked at around 75 GB/s, illustrating how modern hardware can significantly improve inference speeds through increased bandwidth.

In conclusion, the host suggests that system bandwidth is a key determinant of inference performance, more so than raw core counts or other specs. For users with limited GPU resources, understanding their system’s bandwidth can help determine which models they can run effectively. The video encourages viewers to share their own results and experiences in the comments, highlighting that tuning and system configuration play vital roles in optimizing inference performance.