The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

Vincent Chen of Snorkel AI discusses the critical need for comprehensive, well-designed benchmarks that accurately evaluate AI agents in complex, real-world environments by incorporating diverse tasks, robust methodologies, and user-friendly design. He emphasizes that such benchmarks not only measure progress but also guide future AI development, inviting the community to contribute to Snorkel AI’s Open Benchmarks initiative to advance realism, autonomy, and complexity in AI evaluation.

Vincent Chen, a research fellow and co-founder at Snorkel AI, presents insights on the art and science of building effective benchmarks for evaluating AI agents. He highlights the current excitement around AI agents and the progress in capabilities, especially in coding, but points out a significant gap in the ability to measure these agents reliably in real-world, high-stakes environments such as finance, insurance, and healthcare. Chen emphasizes the importance of a comprehensive evaluation toolkit that includes field deployments, red teaming, human evaluations, and particularly open benchmarks, which not only measure progress but also define and shape the future direction of AI capabilities.

Chen shares Snorkel AI’s involvement in the Open Benchmarks grants program, which has committed $3 million to support the development of new benchmarks. Reviewing over 120 applications from academia and industry, Snorkel AI has identified key elements that make benchmarks useful and impactful. These include rigorous task quality with well-structured instructions and expert validation, intentional distributional diversity to represent real-world scenarios and failure modes, and ensuring benchmarks remain unsaturated to reveal meaningful model headroom. Robust evaluation methodologies that go beyond accuracy to include factors like cost, latency, and policy adherence are also critical.

On the artistic side of benchmark design, Chen stresses the importance of having a clear thesis that guides the benchmark’s focus and reflects where the field is heading. Benchmarks like Terminal Bench and SWE-Bench exemplify this by making strategic bets on interfaces and workflows that have proven influential in shaping research and development. Additionally, prioritizing researcher and builder experience by making benchmarks easy to use, extend, and integrate into workflows is crucial for widespread adoption and ongoing relevance.

Looking forward, Chen outlines three key dimensions for the next generation of benchmarks: environment complexity, autonomy horizon, and output complexity. Environment complexity involves capturing the nuanced, dynamic, and often messy real-world contexts in which agents operate, such as coding environments with organizational policies and human collaborators. Autonomy horizon refers to the length and continuity of agent operation, emphasizing the need for benchmarks that test agents’ ability to maintain reliability over extended interactions and evolving conditions. Output complexity focuses on producing nuanced, verifiable, and trustworthy outputs that reflect real-world tasks and can serve as meaningful signals for evaluation and training.

In closing, Chen invites the community to contribute to the ongoing development of benchmarks through Snorkel AI’s Open Benchmarks grants, encouraging innovation that pushes the boundaries of realism, autonomy, and complexity in AI evaluation. He underscores the role of benchmarks not just as retrospective measurement tools but as proactive instruments that shape the trajectory of AI research and deployment. The talk concludes with an open call for collaboration and engagement to help define the future of trustworthy and capable AI agents.