World models are AI systems that predict how the world changes in response to actions, enabling planning, reasoning, and safe interaction across various domains, from autonomous vehicles to virtual environments. They come in generative and predictive forms, supporting applications like synthetic data generation, immersive simulations, and decision-making in complex settings, thereby bridging perception, prediction, and action in both real and virtual worlds.
World models are AI systems designed to predict how the state of the world changes over time in response to actions, enabling planning, reasoning, and safe interaction with real environments. The concept originates from cognitive science, where the human mind is thought to maintain an internal model of reality to simulate outcomes and make decisions. In machine learning, world models take the current world state and a hypothetical action as input and predict the resulting future state. This approach is seen as essential for building AI that can understand and navigate the physical world with common sense.
There are two main schools of thought in implementing world models: generative and predictive. Generative models output future states in human-friendly forms like videos, useful for synthetic data generation and interactive environments. Nvidia’s Cosmos Predict exemplifies this approach, using video diffusion models trained on curated physics-focused data to generate realistic future frames. In contrast, predictive models, championed by researchers like Yann LeCun, focus on abstract latent representations that capture high-level patterns and fundamental laws, avoiding distractions from irrelevant pixel details. An example is Meta’s V-JEPA AC, which predicts future latent states conditioned on actions, trained with self-supervised methods emphasizing embedding space reconstruction.
World models have practical applications across industries, especially in autonomous vehicles (AVs). They help generate synthetic data to augment real-world datasets, addressing the challenge of collecting diverse and rare driving scenarios. Companies like Wayve and Waymo use world models to create varied, realistic training data and rigorous safety benchmarks, improving robustness and reliability. Beyond offline data generation, models like Google’s Genie enable interactive 3D environments from text prompts, hinting at future disruptions in gaming and filmmaking, although current limitations in cost and session length remain.
Interactive world models like Fei-Fei Li’s Marble take a different technical approach by decoupling geometry and appearance using Gaussian splats, enabling efficient rendering and streaming of large virtual worlds. This contrasts with pixel-based outputs and aligns more closely with traditional game engine architectures. Such innovations point toward more immersive and scalable virtual environments, enhancing applications in entertainment and simulation. Meanwhile, world models also serve as decision-making aids in embodied AI through model-based reinforcement learning (MBRL), where agents train and plan within learned world models to safely explore and act in complex environments.
Finally, world models are not limited to visual domains; they can model any environment with defined states and actions. For example, Meta’s Coded World Model predicts software environment states to anticipate code execution outcomes, potentially catching bugs faster than running full test suites. Overall, world models represent a versatile and powerful framework for AI to simulate, understand, and interact with diverse real and virtual worlds, bridging gaps between perception, prediction, and action across many fields.