Measuring Agents With Interactive Evaluations

merefield · 8 October 2025 17:01

Greg Camrad of the Arc Prize Foundation discusses the importance of interactive benchmarks, like the ARC AGI 3 video game suite, for measuring AI’s ability to generalize and learn efficiently across diverse, novel tasks—highlighting that current AI still lags behind humans in action efficiency and adaptive skill acquisition. He emphasizes that while surpassing these benchmarks would indicate significant progress toward Artificial General Intelligence, it is not definitive proof of AGI, and encourages the community to engage with and contribute to this evolving evaluation framework.

merefield · 8 October 2025 17:25

Greg Camrad, president of the Arc Prize Foundation, presents an insightful overview of how Frontier AI is measured through interactive benchmarks. He emphasizes that while AI has made significant progress, the critical question is what AI is progressing towards. Traditional benchmarks often focus on narrow, vertical domains, which do not adequately measure generalization—the ability of AI to learn and adapt across diverse tasks. To address this, the foundation adopts an opinionated definition of intelligence as skill acquisition efficiency, or the ability to learn new things efficiently, inspired by Francois Chalet’s 2019 paper. The Arc Prize Foundation aims to guide open progress toward Artificial General Intelligence (AGI), defined as machines learning as efficiently as humans.

Camrad highlights the importance of interactive benchmarks in evaluating intelligence, as intelligence is inherently interactive and unfolds through perception, feedback, and action over time. Static benchmarks, which involve one-shot questions and answers, are insufficient for measuring this dynamic process. He illustrates this with an example of GPT-5 playing Pokémon, a task requiring long-term planning, exploration, and goal management. Interactive benchmarks allow testing of various cognitive abilities such as exploration, memory management, goal acquisition, and cooperation, which static tests cannot capture.

The Arc Prize Foundation is developing ARC AGI 3, a benchmark consisting of 150 open-source video game environments designed to test an agent’s ability to adapt to novel situations without prior instructions. Each game features unique mechanics, ensuring that agents cannot simply overfit to a single game type. The games are split into public test sets for familiarization and private evaluation sets with unseen games to rigorously assess generalization. Human players, representing the general public, are tested on these games to establish baseline performance and action efficiency, which measures how directly and efficiently a player completes a task.

Action efficiency emerges as a critical new metric in evaluating AI performance. Unlike traditional accuracy metrics, action efficiency considers the number of actions or turns taken to complete a task, reflecting how effectively an agent converts environmental information into goal achievement. Camrad shows data comparing human and AI performance, revealing a significant gap where humans complete tasks more efficiently than current AI models like GPT-5. This gap underscores that despite AI’s progress, it has not yet achieved human-level general intelligence or learning efficiency.

In conclusion, Camrad clarifies that while surpassing ARC AGI 3 benchmarks would demonstrate an AI’s ability to generalize across novel environments and execute plans efficiently, it does not yet constitute proof of AGI. However, it represents the most authoritative evidence of generalization in AI to date. He invites researchers and developers to engage with the benchmark by playing the preview games available online and using the API to test their agents. The foundation aims to expand the benchmark to 175 games by early next year, fostering ongoing progress in measuring and advancing interactive AI intelligence.