Building Production Inference for Trillion-Parameter Models on AMD Instinct MI355X

merefield · 14 May 2026 16:53

Zyra, led by Quinton Anthony, has developed an optimized AI inference cloud leveraging AMD Instinct MI355X GPUs to efficiently serve trillion-parameter models on single nodes, enhancing performance and reducing latency for agentic AI workflows. Their deep hardware expertise and innovative software techniques, including novel parallelism and quantization methods, enable scalable, high-accuracy inference services integrated with reinforcement learning, positioning Zyra as a leader in large-scale AI solutions.

merefield · 14 May 2026 17:13

Quinton Anthony, leading AI engineering at Zyra, introduces the company’s new inference cloud product built on AMD hardware. Zyra is a full-stack open super intelligence company offering bare metal GPUs and their own proprietary models trained entirely on AMD compute, alongside popular open-source models like DeepSeek and GLM. These models support their Maya agent, designed to handle diverse AI tasks by leveraging a variety of models served through the Zephyr cloud platform. The focus is on delivering optimized inference services, compute resources, and reinforcement learning environments, all powered by AMD technology.

Quinton emphasizes Zyra’s deep expertise with AMD hardware, having pre-trained an 8-billion parameter open-source model on a large AMD GPU cluster. This experience has enabled them to optimize kernels and parallelism schemes specifically for AMD GPUs, particularly the MI355X. Their approach prioritizes serving a select few large models with high efficiency rather than many models with average performance. This strategy leverages AMD’s strengths, such as large VRAM capacity, high bandwidth, and flexible quantization options, to handle models with massive memory requirements and long context lengths, which are critical for agentic AI workflows.

A key advantage of AMD GPUs highlighted is their ability to fit large models like Kimmy K 2.6 (around one trillion parameters) within a single node, avoiding costly multi-node communication that increases latency and reduces throughput. This capability allows Zyra to serve more users per node, improving cost efficiency and user experience. Synthetic benchmarks show that AMD MI355X GPUs can support nearly double the number of agents compared to competing hardware, making them well-suited for long-running, multi-turn inference workloads that demand low latency and large KV caches.

Zyra’s strength lies not only in hardware but also in software innovations informed by their training experience. They have developed novel parallelism schemes such as tree attention and folded tensor-sequence parallelism to optimize communication and computation across GPUs. Their kernel development benefits both training and inference, creating a feedback loop that improves model quality and efficiency. Additionally, their expertise in quantization and speculative decoding ensures high accuracy and performance, enabling them to deploy advanced inference techniques that maintain model fidelity.

Looking ahead, Zyra plans to continue enhancing their inference cloud with newer models like DeepSeek V4, larger and more accurate speculative decoding models, and novel quantization methods. These improvements will support their broader vision of integrating inference and reinforcement learning products, providing scalable, high-performance AI services powered by AMD GPUs. Quinton concludes by emphasizing the synergy between their training and inference efforts, positioning Zyra as a leader in delivering efficient, large-scale AI inference solutions.