RTX 5090 power limit hits hard for LLMs...vs Desktop GPUs

The video compares Nvidia’s RTX 5090 laptop GPU to its desktop counterpart and other desktop GPUs running large language models, revealing that the laptop version suffers significant performance drops due to power and thermal limits despite better power efficiency. While desktop GPUs deliver higher raw speed, especially for larger models, the laptop RTX 5090 offers competitive efficiency, highlighting the trade-offs between performance and power consumption in LLM workloads.

The video compares the performance of Nvidia’s RTX 5090 laptop GPU against its desktop counterpart and other desktop GPUs when running large language models (LLMs). Using the Quen 2.5 coder 32 billion parameter model in LM Studio, the desktop RTX 5090 achieves 62 tokens per second with 22 GB of VRAM used out of 32 GB, thanks to 4-bit quantization reducing the model size to about 19.85 GB. The laptop RTX 5090, with 24 GB VRAM, runs the same model but at half the speed—31 tokens per second—while using about 21 GB of VRAM. This highlights the significant performance gap between mobile and desktop GPUs despite sharing the same model name.

A key factor behind the laptop GPU’s lower performance is its power limit and thermal constraints. The mobile RTX 5090 has a maximum power limit of 175 watts, compared to the desktop card’s 600 watts. When unplugged, the laptop GPU throttles down to around 55 watts, drastically reducing performance to about 10 tokens per second. In contrast, the desktop GPU runs at around 570 watts during heavy workloads, enabling much higher throughput. The video also demonstrates that reducing the desktop GPU’s power limit to 400 watts only slightly decreases performance, showing the desktop card’s efficiency at lower power levels.

The presenter then compares the laptop RTX 5090 to desktop RTX 5080 and 5060 Ti cards. The RTX 5080, with 16 GB VRAM, cannot fit the 32 billion parameter model entirely in GPU memory, causing some processing to spill over to the CPU and drastically reducing tokens per second to 5.6. However, for smaller 8 billion parameter models, the 5080 outperforms the laptop 5090, achieving 132 tokens per second versus 104. The RTX 5060 Ti, also with 16 GB VRAM, performs worse than the laptop 5090 on the 8 billion parameter model, with 73.8 tokens per second, and struggles even more with the larger model.

In terms of efficiency measured by tokens per watt, the laptop RTX 5090 surprisingly outperforms the desktop 5090, achieving 0.87 tokens per watt compared to 0.68. The desktop RTX 5080 also shows good efficiency at 0.94 tokens per watt. However, the 5060 Ti lags behind with 0.63 tokens per watt. The video notes that while raw speed favors desktop GPUs, efficiency metrics might become more relevant with the arrival of newer, more power-efficient desktop GPUs like the M4 and M5 series.

Finally, the presenter mentions plans to test AMD GPUs and high-end professional cards like the RTX Pro 6000 in future videos. They also highlight the importance of considering both speed and power consumption when choosing hardware for running LLMs, especially for users concerned about electricity costs and thermal management. The video concludes with an invitation to join the channel’s community for more in-depth content and thanks a sponsor for supporting the production.