Jetson Thor LLM Performance Gains - Up to 3.3x Faster!

In the video, Gary Sims highlights Nvidia’s recent software update for the Jetson Thor platform that significantly enhances large language model inference performance, achieving up to 3.3x speed improvements with the VLLM engine, especially when running large models and multiple concurrent users. He emphasizes the combination of powerful hardware and ongoing software optimizations as key to maximizing generative AI capabilities on the Jetson Thor.

In this video, Gary Sims discusses recent performance improvements to Nvidia’s large language model engine, VLLM, specifically optimized for the Jetson Thor platform. Nvidia has released a software update that enhances generative AI performance by incorporating features like FlashAttention support, Xformers integration, and other optimizations. These improvements demonstrate Nvidia’s commitment to delivering not only powerful hardware but also frequent software enhancements to maximize the capabilities of their devices. The update comes just weeks after the initial release of the Jetson Thor and its software stack, highlighting Nvidia’s ongoing efforts to optimize performance.

Gary provides a brief overview of the Jetson Thor development kit, which includes the Nvidia Jetson T5000 module featuring the Blackwell GPU architecture. The system boasts 2,560 CUDA cores, 96 fifth-generation tensor cores, 14 ARM Neoverse V3 CPU cores, 128 GB of RAM, and 1 TB of storage. It also offers extensive connectivity options such as Wi-Fi 6, 5 Gb Ethernet, USB, DisplayPort, and HDMI. The power consumption ranges from 40 to 130 watts depending on the configuration. This powerful hardware setup makes the Jetson Thor ideal for running large language models and serving multiple users simultaneously.

VLLM stands out as an open-source, high-throughput inference server designed to optimize large language model serving. Unlike other engines, VLLM can handle multiple users concurrently by employing techniques like paged attention and dynamic batching to efficiently manage GPU memory. This results in increased throughput for generative AI applications, making it well-suited for deployment on the Jetson Thor platform. Gary emphasizes that this multi-user capability is a key advantage of VLLM.

Gary shares benchmark results from his own testing using the Jetson Thor. For the Llama 3 70-billion parameter model, which requires 43 GB of RAM, a single connection initially achieved 4.7 tokens per second with the original software. After the update, this improved to 6.2 tokens per second, a 1.3x speed increase. For the smaller Llama 3 8-billion parameter model running with eight concurrent connections, throughput rose from 145 to 210 tokens per second, a 1.4x improvement. The most significant gain was seen with the large Llama 3 70-billion parameter model running eight simultaneous connections, where performance jumped from 12.5 to 41.2 tokens per second, representing a 3.3x increase.

In conclusion, the video highlights how Nvidia’s software optimizations for the Jetson Thor platform significantly boost the performance of large language model inference, especially when handling large models and multiple concurrent users. These improvements underscore the importance of software in complementing powerful hardware to deliver the best AI performance. Gary encourages viewers to check out his previous review of the Jetson Thor development kit and invites them to like the video if they found the information useful.