LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

Dat Ngo from Arize AI discusses the importance of observability, evaluation, and experimentation in building reliable AI systems, highlighting tools like OpenTelemetry for monitoring and the need for collaborative evals to align AI performance with business goals. He also presents Arize’s products, Arize Phoenix and Arize AX, which aim to automate the AI feedback loop and provide scalable insights for enterprises to continuously improve their AI applications.

Dat Ngo from Arize AI presents an insightful overview of the challenges and solutions in building and maintaining AI systems, particularly focusing on observability, evaluation, and experimentation. He introduces himself as an AI architect working closely with major enterprises, emphasizing the importance of making AI systems work reliably in production. Dat highlights that AI development shares many patterns with traditional software engineering, though it often feels magical due to its complexity and non-deterministic nature.

The first key topic Dat covers is observability, which involves understanding what is happening inside AI systems such as agents or harnesses. He explains the use of OpenTelemetry (Otel) for tracing and monitoring, which provides detailed audit records of system behavior. Observability extends beyond traces to include sessions that capture the state and back-and-forth interactions within conversations or workflows. This comprehensive visibility helps teams identify performance bottlenecks, errors, and unusual behavior across different execution paths or branches in their AI applications.

Next, Dat discusses evaluation (evals), which is about deriving meaningful signals from AI system outputs to assess performance and user satisfaction. He categorizes evals into different types based on scope and complexity, from simple single-span input-output checks to multi-span and trajectory evals that analyze interactions across multiple components or entire workflows. He stresses the importance of balancing technical and non-technical perspectives, enabling both AI engineers and domain experts to contribute to defining and running evals. This collaborative approach helps ensure that AI systems meet business goals such as increasing revenue, reducing costs, or saving time.

Dat then moves on to experimentation and continuous improvement, describing how teams can use collected data and eval results to test changes in prompts, models, or configurations. He notes that while manual dashboards and tools are useful, the future lies in automation. Arize aims to automate the entire feedback loop—observability, evaluation, experimentation, and improvement—using AI-driven systems that can detect issues, generate evals dynamically, and suggest fixes without heavy manual intervention. This vision is to make AI system maintenance feel seamless and almost magical, reducing the burden on human operators.

Finally, Dat introduces Arize’s two main products: Arize Phoenix, an open-source, easy-to-deploy observability tool for engineering-first users, and Arize AX, a more comprehensive platform tailored for large enterprises like Uber, Booking, and Reddit. Both products support the company’s mission to provide deep insights and automation capabilities for AI systems at scale. Dat concludes by inviting further discussion and expressing enthusiasm for advancing AI observability and evaluation practices in the industry.