Chelsea Finn: Building Robots That Can Do Anything

Chelsea Finn discusses developing general-purpose robots using foundation models trained on large-scale, diverse real-world data to enable robots to perform a wide range of tasks and generalize to new environments. Her approach combines teleoperated data collection, two-stage training, and hierarchical vision-language-action models to improve robot adaptability, reliability, and responsiveness to complex, open-ended instructions.

In her talk, Chelsea Finn discusses the challenge of developing general-purpose robots capable of performing a wide range of tasks in diverse environments. She highlights the current robotics landscape where solving each application often requires building a dedicated company with custom hardware and software, making it difficult to scale and generalize. To address this, her company, Physical Intelligence, is working on creating foundation models for robotics—generalist models trained on large-scale, diverse robot data that can be fine-tuned for various tasks, similar to how foundation models have revolutionized language processing.

Finn emphasizes the importance of scale in training these models but notes that scale alone is insufficient without diversity and quality of data. She shares their approach of collecting real robot data through teleoperation across multiple tasks, such as folding laundry, lighting candles, and tidying rooms. Starting with simpler tasks like folding a single type of shirt, they gradually increased complexity by introducing crumpled clothes, varied garments, and different starting positions. A key breakthrough was using a two-stage training process involving pre-training on all data followed by fine-tuning on a curated, high-quality dataset, which significantly improved the robot’s performance and reliability.

The talk also covers the challenge of enabling robots to generalize to new, unseen environments. Physical Intelligence collected diverse data from over 100 unique homes and simulated environments, training models that could successfully perform tasks like cleaning kitchens and bedrooms in Airbnbs the robots had never encountered before. They found that including diverse data from multiple environments was crucial for generalization, and that excluding data from other robot domains reduced performance. Despite an 80% success rate, Finn acknowledges ongoing challenges such as speed, partial observability, and occasional errors in task execution.

Finn further explores how robots can respond to open-ended language prompts and interjections by leveraging hierarchical vision-language-action models. They augment robot data with synthetic human prompts generated by language models, enabling robots to break down complex instructions like making sandwiches into subtasks and adapt to corrections or changes mid-task. This approach outperforms using large pre-trained language models alone, which often struggle with the visual and physical reasoning required for robotics applications.

In closing, Finn reflects on the broader implications of their work, emphasizing that general-purpose robots built on foundation models can leverage shared knowledge across tasks and hardware, reducing the need to start from scratch for each application. She stresses that while large-scale real-world data is necessary, it is not sufficient, and much research remains to improve reliability and robustness. The talk concludes with a Q&A covering topics such as the role of reinforcement learning, synthetic data, infrastructure challenges, and the interplay between academic and industry research in robotics.