Vincent Koc from Comet ML highlights the limitations of traditional static AI evaluation methods and advocates for adaptive, continuous evaluation frameworks that evolve alongside dynamic, agentic AI systems. He proposes integrating telemetry and chaos engineering principles to create real-time, intent-driven assessments that better reflect real-world AI behavior and user interactions.
Vincent Koc from Comet ML discusses the evolving landscape of AI evaluation, emphasizing the shift from static benchmarks to adaptive, intent-driven systems. He begins by highlighting the traditional approach to software and AI evaluation, which relies heavily on static benchmarks, unit tests, and offline evaluations. While these methods have served well in conventional software engineering, they fall short in the dynamic and rapidly evolving world of AI, where models and applications continuously adapt and change. Vincent points out that this static mindset leads to gaps in understanding and managing AI behavior, especially as AI systems become more complex and agentic.
He introduces the concept of chaos engineering and observability as crucial practices in software development that are often missing in AI evaluation. Chaos engineering involves deliberately breaking systems to understand their limits, a practice that could greatly benefit AI systems by revealing weaknesses and areas for improvement. Vincent notes that current AI evaluations focus too much on benchmarks that do not keep pace with the fast-changing nature of AI applications. He stresses the need for evaluations that evolve alongside AI systems, adapting to new behaviors and contexts rather than relying on fixed datasets and tests.
Vincent then traces the evolution of AI interaction from prompt engineering to context engineering and now to intent engineering. Prompt engineering, which involved trial-and-error with input prompts, has given way to more sophisticated methods like tool calling and retrieval-augmented generation (RAG), allowing for more modular and steerable AI agents. The latest phase, intent engineering, leverages the ability of AI systems to self-optimize based on user intent, making the evaluation process more complex but also more aligned with real-world usage. This shift challenges traditional evaluation methods because different users may experience AI agents differently, requiring more personalized and adaptive testing frameworks.
To address these challenges, Vincent proposes a new paradigm where evaluations are continuous, adaptive, and integrated into the AI system itself. He envisions evaluations that use telemetry and trace data to detect changes in user behavior or system performance, enabling the AI to self-correct and optimize in real time. This approach treats evaluations not as static datasets but as living, evolving processes that respond to the environment and user needs. He calls this concept “eval calcification,” highlighting the risk of evaluations becoming outdated unless they evolve alongside AI systems.
In conclusion, Vincent urges the AI community to rethink evaluation as a dynamic, ongoing process that embraces the malleability of modern AI. He encourages viewing evaluations as code or living agents that continuously adapt, rather than fixed points in time. This mindset shift is essential for managing the unpredictable 20% of AI behavior that can cause significant issues in production. While his team at Comet ML is still developing these adaptive evaluation systems, Vincent’s insights provide a roadmap for future AI evaluation practices that are more resilient, responsive, and aligned with the realities of agentic AI.