Why are diffusion LLMs so fast?

Diffusion LLMs are much faster than traditional autoregressive models because they generate and refine entire responses in parallel, eliminating the sequential bottleneck of token-by-token generation. Advances in training techniques and inference algorithms further reduce the number of refinement steps needed, making diffusion LLMs highly efficient and likely to become the new standard for text and code generation.

Diffusion large language models (LLMs) are significantly faster than traditional autoregressive models because they generate entire response drafts in parallel and then iteratively refine them, rather than producing text one token at a time. This parallelism removes the sequential bottleneck inherent in autoregressive models, which are limited by their need to process each token in order. As a result, diffusion LLMs can fully utilize GPU capacity, leading to much lower inference latency. For example, Mercury Coder, a diffusion LLM by Inception, is already five times faster than similarly sized autoregressive models like Claude Haiku or Gemini Flash, and the performance gap is expected to widen as the technology matures.

The speed advantage of diffusion LLMs depends on two main factors: keeping the number of refinement steps low and ensuring each step is as computationally efficient as a single autoregressive pass. Early diffusion models, such as Lada, required up to a thousand refinement steps, which negated much of the parallelism benefit. To address this, researchers have developed training techniques like self-distillation and curriculum learning. Self-distillation involves training a model to mimic its own longer refinement paths with shorter ones, effectively halving the number of steps without sacrificing quality. Curriculum learning, on the other hand, gradually increases task difficulty during training, making the model more robust and efficient at denoising in fewer steps.

Inference algorithms also play a crucial role in diffusion LLM efficiency. Unlike autoregressive models, diffusion models can use various sampling strategies to decide which tokens to refine at each step. Simple random remasking matches the training corruption process but is suboptimal. More advanced methods use confidence scores to selectively remask uncertain tokens, and guided diffusion introduces a lightweight autoregressive supervisor to catch global inconsistencies, such as repeated words, that parallel decoding can miss. These smarter sampling techniques further reduce the number of necessary refinement steps, boosting overall speed.

Optimizing each diffusion step is another area of active research. While traditional transformer-based optimizations like fast attention kernels can be reused, key-value (KV) caching—a major speedup for autoregressive models—is challenging for diffusion LLMs due to their bidirectional attention and frequent context changes. However, approximate caching strategies, such as caching prompt embeddings or using delayed response caching, can still provide some speedup. The most precise solution is block diffusion, a hybrid approach where the context window is divided into blocks generated sequentially, allowing exact KV caching within blocks and enabling variable-length generation.

Overall, diffusion LLMs are poised to replace autoregressive models for many applications due to their superior scalability and efficiency at inference time. The technology is still evolving, with ongoing research into better training and inference techniques, as well as custom inference engines tailored for diffusion models. Open-source diffusion LLMs are available on platforms like Hugging Face, and commercial APIs such as Inception’s Mercury models offer production-ready endpoints. As these models continue to improve, they are expected to become the dominant paradigm for both text and code generation tasks.