In this bonus episode, Hannah Fry tours the Google DeepMind robotics lab, showcasing advanced robots that use large vision-language-action models (VALAs) to understand, adapt, and perform complex multi-step tasks with impressive dexterity and generalization. Despite current limitations in speed and data efficiency, these breakthroughs highlight a promising future where robots can autonomously reason, interact flexibly with their environment, and revolutionize everyday life.
In this bonus episode of the Google DeepMind podcast, host Hannah Fry takes us on a tour of the Google DeepMind robotics lab in California, guided by Keshkaro, the director of robotics. The discussion highlights the significant advancements in robotics over the past few years, emphasizing that these robots are not pre-programmed for specific tasks but are instead capable of understanding and adapting to a wide range of instructions. Unlike earlier models, which required privacy screens due to visual limitations, the current robots benefit from robust visual backbones that allow them to operate effectively in open lab environments without being affected by lighting or background changes.
A major breakthrough in robotics has been the integration of large vision-language models (VLMs) and multimodal models that combine vision, language, and action, known as VALAs (Vision, Language, and Action models). These models enable robots to generalize across new scenes, visuals, and instructions, allowing them to perform complex, long-horizon tasks rather than just short, simple actions. For example, a robot can now plan and execute a sequence of actions like checking the weather for a trip and packing a suitcase accordingly. This layered approach, building on foundational models, enhances the robot’s ability to think through and perform multi-step tasks autonomously.
The tour showcases impressive demonstrations of robotic dexterity and generalization. One robot packs a lunchbox with millimeter precision, handling delicate tasks like zipping a bag and placing items carefully. This dexterity is achieved through teleoperation data, where humans demonstrate tasks for the robot to learn. Another robot demonstrates generalization by responding to spoken commands to manipulate objects it has never seen before, such as opening a container and placing unfamiliar items inside. These examples illustrate the robots’ ability to understand and interact with their environment flexibly and intelligently.
Further demonstrations include a humanoid robot sorting laundry by color, showcasing an end-to-end thinking and acting model that outputs its “thoughts” before taking actions. This approach mirrors techniques used in language models, where articulating reasoning improves performance. The robots can also handle completely new objects and adapt to changing scenarios, although they are still somewhat slow and not perfect. The team acknowledges that while current models are foundational, there is still a need for breakthroughs in data efficiency and safety to make robots more practical and reliable for everyday use.
Overall, the episode underscores the remarkable progress in robotics driven by advances in AI and large multimodal models. The main bottleneck remains the limited amount of real-world physical interaction data available for training. However, with continued development and potentially leveraging unstructured human-generated data like instructional videos, the future holds promise for a robot revolution. Robots capable of understanding semantics, reasoning through complex tasks, and generalizing across diverse environments are no longer a distant dream but an emerging reality.