Stanford Webinar - Building Trustworthy AI: Navigating Security in the Agentic Era

The Stanford webinar on “Building Trustworthy AI: Navigating Security in the Agentic Era” featured experts Neil Dwani and Ben Raalo discussing the multifaceted challenges of creating secure and trustworthy AI systems, emphasizing collaboration between safety and security teams, and addressing emerging risks like prompt injections and data exfiltration in agentic AI. They concluded with cautious optimism, highlighting ongoing research, regulatory efforts, and the importance of vigilance and collaboration to harness AI’s benefits while mitigating its risks.

The Stanford webinar on “Building Trustworthy AI: Navigating Security in the Agentic Era” featured two distinguished guests, Neil Dwani and Ben Raalo, both experts in cybersecurity and AI safety. Neil Dwani, co-director of the Stanford Advanced Cybersecurity Program, has extensive experience in cybersecurity across various companies including Google and Twitter. Ben Raalo, CTO of Roost and former head of AI safeguards at Anthropic, brings deep expertise in trust, safety, and risk management from his work at YouTube, Stripe, Airbnb, and Google. The session began with the hosts highlighting the importance of trustworthy AI and the unique perspectives both guests bring to the discussion.

The conversation explored the concept of trustworthy AI, emphasizing that it is a multifaceted challenge. Ben Raalo outlined key properties of trustworthy AI, including technical robustness, functional safety, social appropriateness, explainability, and respect for user privacy. Neil Dwani added to this by referencing a recent collaborative paper that identified eight characteristics of trustworthy large language models (LLMs), including truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability. Both experts agreed that trustworthiness in AI is complex and requires ongoing research and development.

A significant portion of the discussion focused on the collaboration between security and safety teams within organizations to build secure AI systems. Ben Raalo described how the traditional separation between safety and security is blurring, especially with the rise of agentic AI systems that interact with external tools and data. Neil Dwani emphasized the importance of a paranoid and prepared mindset, advocating for early collaboration between teams to anticipate abuse cases and implement guardrails. Both stressed that security design reviews and continuous monitoring are essential to maintaining trustworthy AI systems.

The webinar also addressed emerging security risks in AI, particularly those introduced by agentic AI and protocols like Anthropic’s Model Context Protocol (MCP). Ben Raalo highlighted the lack of built-in security in MCP and the risks of prompt injections, data exfiltration, and untrusted code execution. Neil Dwani agreed that agentic AI presents both the biggest new risks and is also somewhat overhyped. They discussed specific attack classes such as prompt injections, jailbreaks, model extraction, and hallucinations, explaining their nature and potential defenses. Both experts underscored the need for cryptographic methods, access controls, and rigorous vetting of user inputs as foundational security practices.

In closing, both guests expressed cautious optimism about the future of AI security and safety. They acknowledged the rapid pace of AI development and the challenges it brings but highlighted ongoing research, government regulations, and the growing community of experts dedicated to addressing these issues. They encouraged more people to enter the fields of AI safety and security to help build trustworthy AI systems. The session ended on a hopeful note, emphasizing that with careful design, collaboration, and vigilance, society can harness the benefits of AI while mitigating its risks.