Building Generative Image & Video models at Scale - Sander Dieleman, Google DeepMind

Sander Dieleman from Google DeepMind discusses the development of scalable generative image and video diffusion models, emphasizing the importance of data curation, latent representations, advanced architectures like transformers, and sophisticated training and sampling techniques including guidance and distillation. He also highlights the integration of diverse control signals beyond text prompts to enhance model versatility and practical applicability in generating high-quality audiovisual content.

Sander Dieleman, a research scientist at Google DeepMind, presents an in-depth overview of building generative image and video models at scale, focusing primarily on diffusion models. He begins by emphasizing the critical role of data curation in training high-quality models, noting that unlike traditional research incentives that favor standardized datasets, investing time in improving and understanding data can yield better results than tweaking model architectures or optimizers. He then discusses data representation, explaining that while early diffusion models operated directly on pixel data, modern approaches use learned latent representations via autoencoders to compress images and videos. This compression reduces memory requirements significantly while preserving essential spatial structure, enabling scalable training on high-resolution and long-duration audiovisual data.

Dieleman explains the core mechanism of diffusion models as an iterative denoising process. Starting from a noisy version of an image or video, the model predicts a cleaner version step-by-step, gradually refining the sample until it resembles data from the target distribution. He illustrates this with a two-dimensional analogy and highlights the importance of adding a small amount of noise back after each denoising step to prevent error accumulation. He further elaborates on the spectral properties of images, showing that diffusion models effectively perform a form of spectral autoregression by reconstructing images from coarse to fine details, which aligns well with human perception and allows for weighting frequencies during training.

Regarding model architecture, Dieleman notes that early diffusion models used convolutional U-Nets, but transformers have become increasingly popular due to their scalability and expressiveness, especially given the extensive research on transformer scaling from language models. For video generation, he describes a hybrid approach combining autoregression in the temporal dimension with diffusion-based generation of individual frames, balancing efficiency and quality. Training these large models requires sophisticated parallelism strategies, including data and model parallelism, often implemented using JAX, which is well-suited for TPU hardware and large-scale distributed training.

Sampling from diffusion models involves a trade-off between stochastic and deterministic methods. Dieleman highlights the concept of guidance, a powerful technique that steers the generative process using conditioning signals such as text prompts. By amplifying the difference between conditional and unconditional model predictions, guidance improves sample quality at the cost of diversity, enabling models to produce highly relevant and detailed outputs. He also discusses distillation methods aimed at reducing the number of sampling steps, such as consistency models, which attempt to predict final outputs in fewer passes, thereby speeding up generation without sacrificing quality.

Finally, Dieleman addresses the importance of control signals beyond text prompts to make generative models more useful and versatile. These include reference-based conditioning, camera motion control, and event timing in videos, which often require specialized representations and post-training techniques to integrate effectively. He underscores that conditioning can be introduced in various ways within transformer architectures and that post-training methods like preference tuning further refine model behavior. Throughout the talk, Dieleman provides insights into the practical challenges and innovations in scaling generative audiovisual models, offering a comprehensive behind-the-scenes perspective on this rapidly evolving field.