Evals in Action: From Frontier Research to Production Applications

The presentation introduces OpenAI’s “Evals,” a rigorous evaluation framework designed to measure AI models’ performance on complex, real-world tasks using expert grading and automated tools, demonstrating significant progress toward human-level capabilities. It also highlights practical applications and best practices for integrating these evaluations into AI development, enabling developers to build more reliable and efficient AI systems across various industries.

In this presentation, Tel, a researcher at OpenAI, introduces the concept of “Evals” — evaluations designed to measure the capabilities of AI models, particularly in reinforcement learning. Traditional academic benchmarks like the SAT or LSAT have been useful but limited, as they do not fully capture a model’s ability to perform real-world tasks. To address this, OpenAI developed GDP val, an evaluation suite focused on economically valuable, real-world tasks across major sectors contributing to the US GDP. These tasks are multimodal, long-horizon, and created by experts with extensive experience, aiming to assess how well AI models can perform work comparable to human professionals.

GDP val uses pairwise expert grading, where human experts compare AI-generated outputs to human work without knowing which is which, providing an unbiased win rate metric. Early models like GPT-4 scored below 20% win rate, indicating limited preference over human work, but newer models have improved significantly, reaching nearly 40%. This progress suggests that AI models are rapidly approaching human-level performance on complex, real-world tasks. The evaluation also highlights the cost and time efficiency of AI assistance, showing potential for significant savings when models are used in the workflow.

Henry, who leads OpenAI’s Evals product, then discusses the importance of rigorous evaluation for developers building AI applications and agents. Building high-performing AI systems remains challenging due to model nondeterminism, evolving models, and high user expectations, especially in regulated industries like finance and healthcare. OpenAI’s eval product aims to simplify and automate the evaluation process, offering tools such as a visual eval builder, trace grading for multi-agent systems, automated prompt optimization, and support for third-party models. These features help developers identify errors, optimize prompts, and improve their AI applications systematically.

A demo showcases how an investment fund uses OpenAI’s agent kit and eval tools to build and evaluate a multi-agent system for financial analysis. The process involves creating datasets, running evaluations on individual agents, annotating outputs with expert feedback, and using LLM-based graders to automate scoring. The system also supports trace grading to analyze end-to-end agent performance and identify specific failure points. Automated prompt optimization accelerates iteration by rewriting prompts based on grading feedback, improving the quality of AI-generated outputs efficiently.

The presentation concludes with best practices for building effective evals: start simple and early in the development process, use real user data rather than hypothetical examples, and involve subject matter experts for annotation and grading. OpenAI emphasizes the importance of integrating evaluation into the development lifecycle to build reliable, high-performing AI products. The new eval tools and frameworks are designed to democratize rigorous evaluation, enabling developers to build better AI applications with confidence and precision.