How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

Phil Hetzel explains that agent observability differs from traditional observability by addressing the non-deterministic, complex nature of agents powered by large language models, requiring specialized tools to analyze qualitative aspects and large, unstructured trace data. Braintrust’s approach combines advanced database technology, involvement of both technical and domain experts, and LLM-driven analysis to enable real-time insights and iterative improvements beyond conventional monitoring methods.

Phil Hetzel, lead of solutions engineering at Braintrust, presents a theoretical discussion on how agent observability (o11y) differs from traditional observability. Traditional observability focuses on monitoring application uptime and technical performance metrics such as latency, error rates, and traces, using established tools like Grafana and Datadog. These tools help ensure that applications are operational and performing as expected from a technical standpoint. Braintrust, while an agent observability platform, also uses traditional observability tools for monitoring basic system health.

The key distinction highlighted is that agents, especially those powered by large language models (LLMs), are non-deterministic and highly variable, unlike traditional deterministic applications. This non-determinism means agent observability must capture a broader and more complex set of metrics beyond typical latency and error counts. For example, agent observability needs to assess qualitative aspects such as whether the agent’s responses are grounded in context, aligned with brand standards, or if the agent used the appropriate tools during reasoning—factors that traditional observability tools cannot measure effectively.

Agent traces present unique challenges due to their semi-structured and voluminous nature, often containing large amounts of unstructured text data. These traces can be extremely large, sometimes exceeding gigabytes in size, and require specialized systems to ingest, process, and query them efficiently in real time. Braintrust has developed a custom database optimized for these workloads, incorporating features like write-ahead logs for instant trace visibility, advanced indexing (including full-text search using a forked version of the Tantivy framework), and SQL-like querying capabilities to handle the complexity and scale of agent trace data.

Another significant difference lies in the personas involved in observability. Traditional observability is typically managed by technical roles such as systems engineers or product engineers. In contrast, agent observability benefits from the involvement of both technical and non-technical experts, including domain specialists like clinicians, lawyers, or financial advisors. These experts can contribute valuable insights by reviewing traces and providing human annotations, which help improve agent quality and enable the creation of scalable automated scoring functions based on expert feedback.

Looking ahead, Braintrust is advancing agent observability by integrating lightweight LLMs to analyze and cluster agent traces, extracting insights such as user intent, sentiment, and common failure modes. This approach aims to accelerate the feedback loop between production issues and experimental fixes, enhancing agent performance iteratively. The platform also supports both real-time observability and batch evaluation (evals), treating them as variations of the same underlying problem. Overall, agent observability represents a fundamentally different and more complex challenge than traditional observability, requiring new tools, workflows, and collaborative approaches.