Introduction to Primus

AMD’s Primus is a comprehensive software solution within the ROCm ecosystem designed to optimize large-scale AI model training through hardware-software co-optimization, advanced parallelism strategies, and high-performance kernel enhancements. It ensures efficient, stable, and scalable training across thousands of GPUs with components like Primus LM, Primus Turbo, and SAFE, supporting popular models and providing tools for performance monitoring, fault tolerance, and seamless developer experience.

In this presentation, Zenu introduces AMD’s software solution, Primus, designed for training large-scale AI models efficiently and at scale. AMD powers some of the world’s top supercomputers, with 60% of the top 10 superclusters using AMD GPUs, emphasizing both performance and energy efficiency. Primus aims to optimize the entire training pipeline from pre-training to reinforcement learning by combining hardware and software co-optimization, addressing communication, storage, and compute design to ensure high performance and stability in large-scale AI training.

Primus consists of three main components within the ROCm software ecosystem: Primus LM for large language model training efficiency, Primus Turbo which offers a collection of high-performance kernels, and SAFE, which ensures training stability and fault tolerance through observability and error recovery. The solution supports popular open-source models like the LLaMA family, DeepSpeed, and others, providing tools for profiling, debugging, and performance monitoring to help developers optimize memory utilization and scaling across multiple GPUs and clusters.

The software stack offers comprehensive parallelism strategies including tensor parallelism, pipeline parallelism, context parallelism, and expert parallelism to handle both dense and mixture-of-experts (MoE) models. Primus also introduces advanced techniques such as asynchronous tensor parallelism and efficient pipeline balancing to reduce communication overhead and improve GPU utilization. Experiments demonstrate near-linear strong scaling up to tens of thousands of GPUs, showcasing Primus’s ability to handle extremely large models with minimal performance degradation.

Primus Turbo centralizes AMD’s high-performance kernel optimizations, supporting PyTorch and expanding to JAX, with customizations for operators like DeepSpeed MoE and low-precision floating-point training. The solution addresses communication bottlenecks in MoE models through operator-level aggregation and efficient communication algorithms, enabling excellent scaling on large GPU clusters such as the MI300. This kernel-level optimization complements the overall goal of maximizing training throughput and efficiency.

Finally, the SAFE component provides a three-phase stability architecture: pre-flight checks to validate cluster health, in-flight real-time monitoring to detect anomalies during training, and post-flight root cause analysis to identify hardware or software failures. AMD also trains and open-sources its own model family, including language and vision-language models, ensuring full compatibility and performance on AMD hardware and software. Primus offers a comprehensive, out-of-the-box training experience with continuous updates and support for a wide range of models and frameworks, empowering developers to train AI models at scale efficiently and reliably.