I Made a DGX Spark and Mac Studio Run One LLM Together

The video demonstrates running a large language model by splitting its prefill and decode phases across an Nvidia DGX Spark and a Mac Studio, leveraging the Spark’s superior prefill speed and the Mac’s faster decode capabilities via a high-speed network connection. Despite challenges like network bottlenecks and complex setup, the hybrid system achieved faster overall token generation than either machine alone, showcasing a novel approach to heterogeneous LLM inference.

The video explores the concept of running a large language model (LLM) by splitting its two main computational phases—prefill (prompt processing) and decode (token generation)—across two different machines to leverage their respective strengths. The DJX Spark, equipped with an Nvidia Blackwell GPU and 128 GB unified memory, excels at the compute-heavy prefill phase but is slower at token generation. Conversely, the Mac Mini with an M4 Pro chip and 64 GB unified memory is slower at prefill but much faster at decode due to its superior memory bandwidth. The creator attempts to combine these advantages by running prefill on the Spark and decode on the Mac Mini, a technique known as disaggregated prefill and decode, which is already used in production by some companies but rarely seen on consumer hardware.

Setting up this hybrid system proved challenging, involving complex networking issues primarily due to libP2P’s MDNS discovery problems on macOS. After extensive troubleshooting and hardware adjustments, including using a direct dialing method for network connection, the two machines were successfully linked. The creator ran models like Quen 3.5B in different quantizations optimized for each machine—BF16 on the Spark and 4-bit on the Mac Mini—to maximize performance. Initial tests showed that while the Spark processed prefill tokens much faster, network transfer of the KV cache between machines became a significant bottleneck, consuming up to 96% of the total processing time.

To address the network bottleneck, the creator upgraded to a 50 Gbps Mellanox ConnectX4 network card and a compatible switch, which improved KV cache transfer speeds by about 30%. Using the Llama 3 8B model, benchmarks revealed that the disaggregated setup matched the Spark’s prefill speed and improved decode speed compared to the Mac Mini alone, though the overhead of injecting the remote KV cache slightly reduced decode performance. The time to first token in the hybrid setup was comparable to the Spark alone, indicating that the approach could deliver combined benefits without significant latency penalties.

Further experiments with a more powerful Mac Studio M3 Ultra, featuring 819 GB/s memory bandwidth and 512 GB unified memory, demonstrated even better decode performance, nearly doubling the Mac Mini’s speed. Testing larger models like Llama 3 32B and Gemma 27B showed that the Spark maintained a strong prefill advantage, especially as model size increased, while the Mac Studio’s decode advantage diminished for larger models due to architectural factors like sliding window attention and kernel fusion. Overall, the disaggregated approach consistently recovered the Spark’s prefill speed and leveraged the Mac Studio’s decode capabilities, resulting in faster token generation than either machine alone.

In conclusion, the experiment successfully demonstrated heterogeneous inference by combining the strengths of two different machines over a high-speed network link. While the setup is complex and the hardware expensive, it offers a proof of concept for optimizing LLM inference by splitting workloads. However, for most users investing in new hardware, a single powerful GPU like the RTX Pro 6000 might be a more practical and cost-effective solution. Nonetheless, for those who already own both machines, this approach can extract additional performance by intelligently distributing the LLM workload.