Robotics' End Game: Nvidia's Jim Fan

Jim Fan of NVIDIA outlined a vision for robotics’ future centered on advanced AI models that predict physical world states and robot actions, leveraging innovative data collection methods and scalable neural simulations like Dream Dojo to enable robots to perform diverse tasks autonomously. He highlighted key milestones such as passing the physical Turing test and achieving self-improving robots by 2040, emphasizing the transformative potential of integrating real-world data with cutting-edge simulation to accelerate robotics development.

Jim Fan, leader of NVIDIA’s embodied autonomous research group, opened his talk by reflecting on the rapid advancements in AI and robotics since 2016, highlighting the transformative impact of deep learning. Drawing parallels to language models, he introduced the concept of the “great parallel,” where instead of simulating language, robotics simulates the next physical world state and aligns actions through fine-tuning and reinforcement learning. This approach aims to bring robotics into the “endgame” of AI development, similar to the progress seen in large language models (LLMs).

Fan discussed the limitations of current visual-language-action (VAS) models, which prioritize language and vision but struggle with physical interactions and verbs. He contrasted this with video-based world models that learn physics implicitly by predicting future states, coining the term “physics slop” to describe their emergent understanding of physical laws. Building on this, NVIDIA developed DreamerZero, a policy model that jointly predicts future world states and robot actions, enabling robots to perform tasks they have never explicitly trained on by “dreaming” a few seconds into the future.

On the data front, Fan emphasized the challenges of scaling robot training data, traditionally limited by teleoperation hours. He introduced innovative data collection methods like the Universal Manipulation Interface (UMI), where humans wear robot hands to directly generate training data, and egocentric video datasets that capture human hand movements in natural settings. These approaches drastically reduce reliance on teleoperation and enable scalable, diverse data collection, leading to more generalizable and dexterous robot policies.

Fan also highlighted the importance of scalable simulation environments for reinforcement learning, presenting NVIDIA’s Dream Dojo, a neural simulator that generates realistic robot interactions without relying on classical physics engines. By combining real-world data with advanced simulation, Dream Dojo enables massive parallel training of robot policies, accelerating progress toward fully autonomous systems. This integration of real-to-sim-to-real workflows represents a new paradigm in robotics research.

Concluding, Fan outlined three major milestones remaining in robotics: passing the physical Turing test (indistinguishable human-robot task performance), establishing physical APIs for robot fleets, and achieving physical auto research where robots autonomously improve themselves. He predicted these achievements could be realized by 2040, emphasizing the exponential nature of technological progress. Fan closed with an inspiring message that the current generation is uniquely positioned to solve robotics, bridging the gap between past exploration and future frontiers.