Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18

In this episode of the OpenAI Podcast, experts discuss the development of Multipath Reliable Connection (MRC), a new supercomputer network technology designed to enhance the reliability, efficiency, and scalability of AI training clusters by rapidly adapting to network failures and balancing traffic across multiple paths. Built on open Ethernet standards and supported by major industry partners, MRC significantly improves AI training speed and stability while reducing costs and power consumption, representing a major advancement in AI infrastructure.

In this episode of the OpenAI Podcast, Andrew Mayne discusses with Mark Handley and Greg Steinbrecher the challenges and innovations involved in building supercomputer networks optimized for AI model training. They explain that traditional data center networks, originally designed for internet traffic with many independent conversations, are ill-suited for the highly synchronized and bandwidth-intensive workloads of large-scale GPU clusters used in AI training. The synchronous nature of these workloads means that any slowdown or failure in one GPU or network link can bottleneck the entire system, making network reliability and efficiency critical.

Mark and Greg describe their backgrounds in physics, quantum computing, and networking research, which led them to focus on optimizing data center networks specifically for AI workloads. They highlight that as AI training clusters scale up to thousands of GPUs, the complexity and number of network components increase dramatically, raising the likelihood of failures. Traditional network protocols, which rely on distributed routing and convergence, are too slow and fragile for these environments. This motivated the development of a new networking approach called Multipath Reliable Connection (MRC), designed to handle congestion and failures more effectively.

MRC works by spreading network traffic across multiple paths to balance load and avoid hotspots, while using a technique called packet trimming to quickly detect and recover from packet loss without ambiguity. This approach allows the network to rapidly adapt to link failures independently at each endpoint, eliminating the delays caused by traditional routing protocol convergence. As a result, the network becomes self-healing and highly resilient, enabling continuous, efficient GPU communication even in the face of frequent hardware issues. This innovation has significantly improved the stability and speed of AI training at OpenAI.

The team emphasizes that MRC is built on open standards, specifically Ethernet, and they are collaborating with major industry partners like Microsoft, NVIDIA, Broadcom, AMD, and Intel to standardize and deploy this technology. By simplifying network design and reducing the number of switches needed, MRC also lowers power consumption and cost, making AI supercomputing more efficient and scalable. OpenAI is committed to making MRC an open standard to benefit the broader AI and technology community, fostering collaboration and accelerating progress.

Finally, the discussion touches on the future of AI infrastructure, noting that while MRC addresses many current challenges, ongoing innovation will be necessary as models grow larger and more complex. They also consider the impracticality of deploying such large-scale training networks in space due to latency and maintenance difficulties, underscoring the importance of terrestrial data centers. Overall, MRC represents a significant breakthrough in supercomputer networking, enabling faster, more reliable AI model training and paving the way for continued advancements in artificial intelligence.