LLM on the WORST Storage vs the BEST

The video demonstrates how storage speed dramatically affects loading times for large language models, with slow thumb drives taking several minutes while high-end internal NVMe SSDs reduce load times to under 10 seconds. It concludes that beyond ultra-fast storage, factors like memory and GPU transfer speeds also limit performance, advising optimized application design to minimize frequent model loading.

The video explores the impact of storage speed on loading times for large language models (LLMs) using a high-end 5090 rig. The presenter begins by testing the slowest storage option, a basic thumb drive with a sequential read speed of just 73 MB/s. Loading an 18 GB model from this drive took a staggering 228 seconds, highlighting how slow storage can bottleneck the process. Using LM Studio, the presenter scripted the loading process to measure times accurately and confirmed that even with USB 3.2 connectivity, the slow thumb drive severely limits performance.

Next, the presenter upgrades to a faster thumb drive with a read speed of 390 MB/s connected via a 40 Gbps port, which significantly reduces the load time to 52 seconds. Further improvement is seen with a Thunderbolt 4 external SSD enclosure, delivering read speeds around 3,160 MB/s and cutting the load time down to 13 seconds. This demonstrates how faster external storage solutions can drastically improve model loading times, though still not matching internal drive speeds.

The video then compares network-attached storage (NAS) and direct-attached storage (DAS) solutions. The NAS, equipped with all SSDs and connected via a 2.5 Gbps Ethernet switch, loaded the model in 71 seconds, slower than the faster thumb drives but offering large storage capacity and flexibility for offloading models. The DAS, connected via a 40 Gbps interface and housing multiple SSDs, performed better with an 18-second load time, showing that direct connections generally outperform networked storage for this task.

Internal NVMe SSDs provide the best performance, with a Samsung 990 Pro delivering read speeds of about 7,375 MB/s and loading the model in just 10 seconds. The presenter then tests an even faster SSD capable of nearly 15,000 MB/s read speeds, which transfers large files in mere seconds. However, the actual model load time only improved slightly to around 8.9 seconds, indicating that beyond a certain point, other factors such as memory transfer and GPU VRAM loading become the limiting steps.

In conclusion, the video emphasizes the importance of considering storage speed when working with LLMs, especially for programmatic loading in scripts. While ultra-fast SSDs can reduce load times significantly, the overall process also depends on system memory and GPU transfer speeds. The presenter advises designing applications to minimize frequent model loading to avoid bottlenecks and suggests watching a follow-up video for insights on running LLMs efficiently on the same hardware.