The video examines the challenges of running large AI models on compact Intel-powered mini PCs, revealing that while clustering multiple machines by splitting models incurs significant network overhead and slows token generation, replicating entire models on each device and distributing requests yields better performance. It concludes that despite impressive hardware, software limitations and network bottlenecks make single powerful machines more practical for most AI workloads, though clustering can benefit specific scenarios like serving multiple users or handling very large models.
The video explores the capabilities and limitations of running large AI models on compact mini PCs, specifically using Intel’s latest hardware featuring a new CPU, GPU, and a dedicated AI chip called the NPU. Despite each mini PC being too small to hold a 70 billion parameter AI model individually, the creator attempts to run the model across three such machines simultaneously. These mini PCs, like the ASUS Knuck Pro, are highlighted for their compact size, upgradeability, and professional-grade features, making them attractive for developers who want local AI processing without relying on cloud APIs.
A key technical insight from the video is the distinction between prompt processing and token generation in large language models (LLMs). The GPU significantly speeds up prompt processing, doubling the token reading speed compared to the CPU. However, token generation is bottlenecked by memory bandwidth, meaning the GPU offers no speed advantage here. The NPU, while more power-efficient, is slower and less effective than the GPU for inference tasks. Additionally, Intel’s own software stack (OpenVino) struggles with bleeding-edge models and is outperformed by open-source tools like Llama CPP, which also supports clustering.
When attempting to cluster the three mini PCs to run a model larger than any single machine’s memory, the video reveals that splitting the model across machines introduces significant network overhead. Instead of speeding up inference, clustering in this manner actually halves the token generation speed due to the latency and bandwidth limitations of Ethernet connections. Even upgrading to faster Thunderbolt connections does not alleviate this bottleneck, as the fundamental limitation lies in memory speed and the overhead of frequent inter-machine communication.
The video then demonstrates a more effective clustering approach: replicating the entire model on each machine and distributing inference requests among them. This method scales well for models that fit into a single machine’s memory, resulting in a near-linear increase in throughput as more machines handle requests in parallel. This approach contrasts with model splitting, which is primarily useful for running models too large for any one machine but comes at the cost of speed and efficiency.
In conclusion, the video emphasizes that while Intel’s new hardware is impressive, the software ecosystem is still catching up, especially for bleeding-edge AI workloads. Clustering mini PCs can be beneficial for handling very large models or serving multiple users, but for most use cases, a single powerful machine is preferable. The creator invites viewers to consider whether they would use such clusters for AI workloads or for other purposes like virtual machines, highlighting the practical trade-offs between hardware capabilities, software maturity, and network limitations.