Stanford CS25: V5 I Transformers for Video Generation, Andrew Brown of Meta

merefield · 3 July 2025 15:31

In this Stanford CS25 seminar, Andrew Brown from Meta presents MovieGen, a 30 billion parameter transformer-based foundation model that generates high-quality 1080p videos with synchronized audio by leveraging a temporal autoencoder for efficient video representation and a flow matching generative objective. He highlights the importance of scaling, data quality, and unified transformer architectures in advancing text-to-video generation, while discussing ongoing challenges and future directions such as longer videos, complex motions, and multimodal integration.

merefield · 3 July 2025 15:51

In this Stanford CS25 seminar, Andrew Brown from Meta’s Gen AI team presents cutting-edge research on transformers for video generation, focusing on their recent work with MovieGen, a 30 billion parameter foundation model capable of generating high-quality 1080p videos with synchronized audio. Brown highlights the rapid progress in text-to-video generation over just two years, showcasing how modern models can produce complex, physically plausible videos from text prompts, including impressive examples like reflections and realistic motion. He emphasizes that this leap in quality is largely due to scaling transformers, data, compute, and model parameters, moving away from specialized architectures to a unified transformer-based approach.

Brown explains the core technical components of MovieGen, starting with the representation of video data. Unlike text, which is discrete and semantically rich, video data is continuous and highly redundant, making direct pixel modeling computationally infeasible for high-resolution videos. To address this, they use a temporal autoencoder (TAE) to compress videos into a latent space, reducing sequence length drastically and enabling efficient modeling. This compressed latent representation is then modeled using a transformer architecture based on Meta’s LLaMA 3, adapted with cross-attention layers for text conditioning, adaptive layer normalization for timestep conditioning, and full bidirectional attention to suit the flow matching generative objective.

The generative learning objective employed is flow matching, a recent generalization of diffusion models that offers more robust training and efficient sampling. Brown outlines the training process where the model learns to reverse a noise-adding forward process by predicting velocity vectors that guide noisy latent samples back to the data distribution. This approach allows the model to generate videos by starting from Gaussian noise and iteratively denoising conditioned on text prompts. The architecture modifications to LLaMA 3 are minimal but crucial, enabling the model to handle video tokens and conditioning effectively while leveraging existing large-scale training infrastructure.

Data quality and scale are underscored as critical factors for successful video generation. Brown details the extensive data curation pipeline used to assemble around 100 million high-quality videos with balanced concept distributions and accurate captions generated by a specialized LLaMA 3 captioning model. The training recipe involves multi-stage training, starting from lower resolution text-to-image generation and progressively scaling up to high-resolution text-to-video generation on thousands of GPUs. Post-training fine-tuning and specialized models for editing, personalization, and audio generation further enhance MovieGen’s capabilities, demonstrating versatility and state-of-the-art performance validated through comprehensive human evaluations.

Looking ahead, Brown acknowledges that video generation is far from solved, with challenges remaining in generating longer videos, complex motions, and detailed sequential actions. He suggests that further scaling, incorporating reasoning capabilities akin to chain-of-thought in language models, and exploring native multimodal generation could drive future breakthroughs. The talk concludes with reflections on the importance of architecture unification, the potential for modality-independent transformer models, and open research questions around physics priors, data quality, and ethical considerations such as watermarking and misuse prevention. Overall, the seminar provides a thorough and insightful overview of the current state and future directions of transformer-based video generation.