1,000 tok/s?! The Age of Diffusion Based LLMs Is Upon Us

artesia · 28 July 2025 14:46

The video introduces diffusion language models (diffusion LMs) as a groundbreaking architecture that enables much faster token generation—exceeding 1,000 tokens per second—by iteratively denoising masked sequences in parallel, offering significant advantages in coding and multimodal AI tasks. While still facing challenges in reasoning and optimization, diffusion LMs like Mercury Coder and Google’s Gemini Diffusion show strong potential to revolutionize AI generation speed and capabilities, marking a promising shift away from traditional autoregressive models.

artesia · 28 July 2025 15:06

The video discusses a significant challenge in AI applications today: the long wait times for generating responses, especially when using large language models (LLMs) in agentic AI systems. Traditional solutions like using smaller models or better hardware are limited by accessibility and cost. However, a promising new architecture called diffusion language models (diffusion LMs) is emerging, potentially revolutionizing generation speed. Unlike conventional autoregressive models that generate tokens sequentially from left to right, diffusion LMs start with a fully masked sequence and iteratively denoise it, predicting multiple tokens in parallel. This approach allows for much faster token generation, with some models achieving speeds exceeding 1,000 tokens per second, far surpassing current autoregressive models.

The video highlights two key commercial diffusion LMs: Inception Labs’ Mercury Coder and Google’s DeepMind Gemini Diffusion. Mercury Coder, released earlier, already achieves impressive speeds and excels in coding tasks, particularly in fill-in-the-middle benchmarks. However, it ranks moderately in broader code completion benchmarks, partly due to the nature of the datasets and competition. Google’s Gemini Diffusion, announced more recently, doubles Mercury’s speed and maintains a large 32K token context window. While Gemini Diffusion performs slightly worse than its autoregressive counterpart in language and reasoning tasks, it matches or exceeds performance in coding benchmarks, indicating diffusion LMs’ strong potential in specialized domains.

The video also explores the technical underpinnings of diffusion LMs, explaining that they rely on transformer architectures but differ fundamentally in their generation process. Instead of generating tokens sequentially, diffusion LMs use a mask-and-denoise strategy that allows bidirectional or global context awareness. This makes them particularly suited for structured and logical tasks, such as coding or infilling text. Despite their speed advantages, diffusion LMs still face challenges in reasoning tasks and require further optimization, especially in handling key-value caching and balancing compute resources.

A particularly exciting frontier for diffusion LMs is multimodal AI, where models process both text and images. Recent research papers like Lavida and Mada demonstrate how diffusion LMs can unify vision and language understanding more naturally than autoregressive models. By attending to both image and text tokens bidirectionally and using a shared mask prediction process, these models can perform complex tasks like structured information extraction and text-to-image generation more efficiently and coherently. This native multimodal diffusion approach offers advantages in global context integration, faster generation, and better structural control, potentially outperforming existing multimodal autoregressive models.

In conclusion, diffusion language models represent a promising new era in AI model architecture, offering significant speed improvements and new capabilities, especially in coding and multimodal tasks. While they are not yet fully competitive with autoregressive models in all areas, ongoing research and optimization could close this gap. The video suggests that diffusion LMs might redefine how AI systems generate and process information, making them a key area to watch in the near future. The creator also encourages viewers to follow their newsletter and upcoming videos for deeper technical insights into diffusion LMs and their evolving landscape.