3 MacBooks Did What One Never Could

The video demonstrates how clustering multiple M5 Max MacBook Pros via Thunderbolt 5 enables running large machine learning models more efficiently, achieving significant speedups and allowing models too large for a single machine to be processed across two or three nodes using tensor and pipeline parallelism. Despite some limitations with model size and parallelism methods, the setup showcases impressive performance, even outperforming more powerful desktop machines in certain tests.

The video explores the performance and capabilities of clustering three M5 Max MacBook Pros using Thunderbolt 5 cables to run large machine learning models in a distributed manner. The presenter begins by setting up a small-scale test with a 4 billion parameter model to verify that the cluster works correctly. Using RDMA-enabled Thunderbolt mesh networking, the cluster achieves a 22% speedup when running the model on two machines compared to one, demonstrating that distributed inference is functional and beneficial even for smaller models.

Next, the presenter tests a much larger 122 billion parameter mixture of experts model, which still fits on a single 128 GB MacBook Pro. Running this model on two nodes yields a 27% speed increase over one node, confirming that the cluster scales well with model size. Importantly, the cluster enables running models that cannot fit on a single machine at all. For example, a 122 GB quantized version of the same model fails to load on one node but runs at 51 tokens per second on two nodes, showing that clustering is essential for handling very large models.

The presenter pushes the limits further with a 185 GB GLM 4.7 model, which also cannot fit on one machine but runs at 29 tokens per second on two nodes, making it usable for real coding tasks. However, attempting to run a 215 GB Llama 3 model on two nodes results in excessive memory swapping and unusable performance, establishing a practical ceiling of about 200 GB for a two-node cluster. To go beyond this, more nodes are required, but tensor parallelism—the method used for distributing work—requires layer dimensions to be divisible by the number of nodes, which complicates using three nodes due to common model dimension sizes.

To overcome this, the presenter experiments with pipeline parallelism using the EXO framework, which allows three-node clusters by slicing models vertically rather than horizontally. Although EXO has some quirks and bugs, it enables running a 397 billion parameter mixture of experts model across three MacBook Pros at 28 tokens per second. This demonstrates that while tensor parallelism is faster, pipeline parallelism expands the range of models that can be run on clusters with more nodes. Finally, a head-to-head comparison shows that two M5 Max MacBook Pros outperform two M3 Ultra Mac Studios by 29% on the same model, highlighting the impressive power and portability of the MacBook Pro cluster setup.