Private AI Framework Cluster… FIXED

The video details upgrades to a private AI cluster of four Framework desktop boards, focusing on improved network hardware (up to 50 GbE) to better support running massive language models and high-concurrency workloads. While network enhancements yield modest speed gains for single large models, the real benefits are seen in enabling larger models and concurrent AI tasks, making the cluster valuable for advanced and multi-user scenarios.

The video covers recent upgrades to a private AI framework cluster built from four Framework desktop boards, each equipped with an AMD Ryzen AI Max Plus 395 (Strix Halo) and 128 GB of RAM, totaling 512 GB across the cluster. The main motivation for clustering these machines is to combine their memory, enabling the running of very large language models (LLMs) such as the one trillion parameter Kimmy K2. The presenter references previous videos by himself and others, noting that while the basic cluster setup has been discussed before, this video focuses on significant network and hardware improvements that enhance performance and usability.

A key bottleneck identified in earlier cluster setups was network speed. The presenter explains that while small models run faster on a single node, distributing larger models across the cluster is limited by network throughput. To address this, he experimented with various network cards, starting with affordable 25 GbE cards and eventually settling on Mellanox ConnectX-4 cards capable of 50 GbE. He also invested in a high-end MikroTik switch to support these speeds, though he notes the cost of breakout and optical cables can be substantial. The Framework boards’ PCIe expansion slots made these upgrades possible, unlike some other mini PCs.

The video demonstrates that upgrading the network does improve performance, but the gains are modest unless running very large models or high concurrency workloads. For example, running a 34B parameter model on a single node is faster than on the cluster due to network overhead, but the cluster becomes essential for models too large to fit in a single machine’s memory. The presenter shares benchmark results showing incremental improvements in tokens per second as network speed increases, validating the advice of other experts in the community.

Beyond raw speed, the presenter discusses more practical uses for the cluster, such as orchestrating multiple AI agents or running concurrent workloads. He highlights the value of software like VLLM, which excels at handling high concurrency and can dramatically increase throughput when multiple requests are processed in parallel. Although there are current limitations with tensor parallelism on AMD hardware in VLLM, running separate instances on each node and using a load balancer to distribute requests yields significant performance gains, especially for agent-based or code-assistant scenarios.

In conclusion, the presenter emphasizes that while clustering is not always about maximizing speed for single chat sessions, it is invaluable for running large models and supporting concurrent AI workloads. He encourages viewers to consider their specific use cases—whether running massive models, orchestrating agents, or supporting multiple users—when deciding on cluster investments. The video ends with a nod to future improvements, including exploring different network topologies and further software optimizations, and invites viewers to check out related content for deeper technical details.