I Reproduced LeCun's JEPA World Model That Doesn't Predict Tokens

artesia · 24 April 2026 14:00

The video demonstrates the successful reproduction of Yann LeCun’s Joint Embedding Predictive Architecture (JEPA) Le World Model, which predicts latent future states instead of exact tokens or pixels, enabling effective planning and understanding in tasks like 2D navigation. Despite hardware limitations, the simplified model achieved a 92% success rate in the two-room environment, highlighting a promising shift in AI towards internal world modeling for more meaningful and efficient intelligence.

artesia · 24 April 2026 14:21

In this video, the creator explores and reproduces Yann LeCun’s recent paper on the Joint Embedding Predictive Architecture (JEPA) called Le World Model, which proposes a novel approach to AI. Unlike traditional AI models that predict the next token or pixel, JEPA focuses on predicting latent representations—internal summaries of meaningful future states—rather than exact outputs. This approach aims to build internal models of the world that enable planning and understanding across various modalities such as text, images, video, and robotics, emphasizing intelligence as the ability to predict meaningful future states rather than just continuing sequences.

The video explains the core concept of JEPA by contrasting it with conventional token-based models. Instead of predicting the next word or pixel, JEPA predicts the latent state that represents the meaningful next step in a sequence or environment. This is particularly useful in robotics and control, where predicting exact pixels is impractical, but predicting the outcome of an action (like a robot’s gripper moving or a cup tipping) is feasible and relevant. Le World Model simplifies previous JEPA approaches by removing complex stabilizing tricks, relying instead on a minimal training objective that combines prediction loss with a single regularizer, making the model easier to reproduce and understand.

The creator then details their reproduction experiment using the two-room environment from the paper, a simple 2D navigation task where an agent must move through rooms to reach a target. Due to hardware limitations (an RTX 3060 GPU), the experiment was scaled down, and various adjustments were made to batch size and training steps to ensure stability and feasibility. Despite these constraints, the model was successfully trained over several epochs, showing healthy training metrics and no signs of collapse, demonstrating that the simplified Le World Model approach can be effectively implemented on consumer-grade hardware.

After training, the model was evaluated using a planning algorithm that leverages the learned world model to imagine future states and select actions to reach goals. The reproduction achieved a 92% success rate in navigating the two-room environment, closely matching the paper’s reported 97%, despite fewer training epochs and hardware limitations. Visualizations of the agent’s navigation showed that the model could effectively plan paths to the goal, outperforming or matching expert demonstrations in most cases, with only a few failures where the agent got stuck.

In conclusion, the video highlights the significance of Le World Model as a practical and elegant implementation of JEPA, showcasing a shift in AI research from token-level prediction to world modeling. This approach aligns with Yann LeCun’s broader vision of intelligence as the ability to build internal models of the environment and predict meaningful future states, enabling planning and reasoning beyond simple sequence continuation. The successful reproduction on modest hardware underscores the potential of this research direction to influence future AI development, making complex world modeling more accessible and efficient.