Rohin Shah, head of AGI safety at Google DeepMind, expresses cautious optimism that current alignment techniques and flexible, transparent safety practices can effectively mitigate catastrophic AI risks, emphasizing the importance of interpretability, expert oversight, and iterative problem-solving. He advocates for practical, implementable research collaborations with AI companies and views AI progress as likely to be gradual rather than explosive, encouraging a focus on tangible impact within the evolving AI safety landscape.
Rohin Shah, head of AGI alignment and safety at Google DeepMind (GDM), shares his perspective on the likelihood of catastrophic AI misalignment. Despite being an early researcher in the field, he remains cautiously optimistic that ordinary alignment techniques currently employed by AI companies will likely prevent catastrophic outcomes. He critiques common arguments for inevitable misalignment, such as deceptive AI behavior or reward hacking, suggesting these issues are plausible but not the default expectation. Rohin emphasizes that many alignment challenges can be anticipated and addressed iteratively, and that current AI models have not yet exhibited the superhuman capabilities that would make oversight impossible.
Regarding AI companies making public safety commitments, Rohin is skeptical of rigid, long-term pledges due to the rapidly evolving nature of AI research and the risk of locking into suboptimal strategies. He advocates instead for flexible approaches, such as third-party audits and detailed safety scorecards like AI Lab Watch, which can adapt over time and provide nuanced evaluations of company practices. Rohin stresses the importance of transparency and expert oversight rather than fixed commitments, arguing that well-informed, motivated individuals—either inside companies or external auditors—are crucial for effective safety governance.
On the topic of AI monitoring, Rohin highlights the value of chain-of-thought monitoring, where AI systems externalize their reasoning in human-readable language, enabling better interpretability and oversight. He explains that due to the architectural constraints of transformers and the nature of pretraining on human language, this form of monitoring is likely to remain effective for several years. While future developments like continuous chain-of-thought or AI-generated internal languages could complicate monitoring, current evidence suggests these challenges are manageable and that interpretability will persist as a useful tool for safety.
Rohin also discusses economic models related to AI progress and the possibility of an intelligence explosion driven by recursive self-improvement. He argues that historical and economic data suggest AI progress is more likely to follow smooth, continuous growth rather than abrupt jumps. Key factors influencing growth include the difficulty of new discoveries, improvements in scientific tools, and the number of researchers. He cautions against conflating AI as merely a tool that enhances human researchers with AI as autonomous researchers themselves, noting that only the latter scenario would likely trigger rapid, hyperbolic growth. Current benchmark analyses support a linear progression in AI capabilities rather than sudden accelerations.
Finally, Rohin offers practical advice for researchers aiming to influence AI companies like GDM. He recommends engaging with company insiders early to ensure research aligns with actual needs, focusing on solutions, metrics, or evaluations that can be implemented rather than purely theoretical work, and prioritizing simplicity and robustness over complex, novel methods. He highlights the importance of implementation and engineering skills within safety teams and notes that GDM is actively hiring for roles related to safety evaluation and AI control. Throughout, Rohin maintains a positive outlook by concentrating on areas where he can make a tangible difference and by trusting in the ongoing efforts within AI companies to responsibly manage AI development.