DGX Spark vs AMD EPYC CPU Local AI Benchmarks

The video compares local AI inference performance between an AMD EPYC 7702 CPU system and NVIDIA DGX Spark, revealing that the EPYC CPU surprisingly outperforms the DGX Spark on larger models like GPT-OSS 120B and GPT-3, while GPUs still lead in raw speed on smaller models. It highlights the viability of CPU-based inference for certain AI workloads and encourages further exploration of newer CPUs and budget-friendly GPU options for local AI setups.

The video compares the local AI inference performance of an AMD EPYC 7702 CPU-based system against the NVIDIA DGX Spark and other GPU setups. The AMD EPYC 7702 system, equipped with eight 32 GB DIMMs and running in performance mode, idles at around 85 to 90 watts, with additional power draw from two Intel U.2 drives. The presenter focuses on CPU-only inference using the Llama model to maintain consistency with previous tests involving the DGX Spark and a quad 3090 GPU rig. The goal is to fill in performance data for pure CPU inference without GPUs attached.

Using the GPT-OSS 120B model, the AMD EPYC system achieved about 15.75 tokens per second during decoding and 20 tokens per second for prompt processing. This performance was surprisingly faster than the DGX Spark, which managed around 11.66 decode tokens per second and 9.4 prompt tokens per second on the same model. The presenter notes that these tests are informal and non-scientific but highlight that the older AMD EPYC CPU, despite being from around 2018 with eight memory channels, delivers impressive inference speeds that challenge the DGX Spark’s capabilities.

When testing the smaller GPT-OSS 20B model, the DGX Spark outperformed the EPYC system, achieving about 50 tokens per second compared to the EPYC’s 21 tokens per second. However, the quad 3090 GPU rig significantly outperformed both, reaching approximately 124 tokens per second. The presenter also tested the GPT-3 model, where the EPYC 7702 showed a remarkable 27.8 tokens per second compared to the DGX Spark’s 6.2 tokens per second, suggesting that the DGX Spark may have some optimization issues or bandwidth limitations despite its higher theoretical memory bandwidth.

The video emphasizes that newer, sparser AI models tend to run well on CPU-based inference, with models like GPT-OSS 12B performing surprisingly efficiently on CPUs. The presenter expresses interest in further testing with newer AMD Ryzen systems, such as the 7995WX, to explore CPU inference performance more broadly. While GPUs still dominate in raw performance, especially with larger models, the results demonstrate that CPUs like the EPYC 7702 can offer competitive inference speeds for certain workloads, making them viable options for local AI tasks.

Finally, the presenter encourages viewers to explore additional resources, including build guides for various AI rigs and budget-friendly GPU options like the NVIDIA 5060 Ti and AMD 960 XT, which offer excellent entry points for AI enthusiasts. The video concludes with thanks to channel members and supporters and invites viewers to share their thoughts and experiences in the comments, fostering a community discussion around CPU versus GPU inference performance in local AI setups.