Your Agent Failed in Prod. Good Luck Reproducing It. - Tisha Chawla & Susheem Koul, Microsoft

Tisha Chawla and Susheem Koul from Microsoft highlight the challenges of reproducing AI agent failures in production due to inherent non-determinism and propose focusing on replayability rather than exact determinism. They introduce Chronicle, a tool that records detailed execution traces and supports selective replay and testing, enabling effective debugging and improving reliability in AI agent deployments.

The video, presented by Tisha Chawla and Susheem Koul from Microsoft, addresses a common challenge in production AI agents: the inability to reproduce failures that occur in real-world environments. They illustrate this with an example where an agent incorrectly interprets a user request to sell $1,000 worth of stock as selling 1,000 shares, leading to a costly mistake. Despite the error, the system returns a successful response, making it difficult to detect and debug. The typical engineering approach of rerunning the same prompt locally often fails to reproduce the issue due to inherent non-determinism in AI models and system-level factors like batch processing and mixture of experts routing.

The speakers emphasize that the pursuit of bitwise determinism—getting the exact same output from the model every time—is misguided and practically impossible with current AI APIs. Instead, they propose focusing on replayability, which means capturing enough information about each run to recreate the state transitions and debug the system effectively, even if the model’s outputs vary. This shift from determinism to observability allows engineers to understand and fix issues without needing the model to behave identically every time.

To implement this concept, they introduce Chronicle, a tool designed to record inputs and outputs at defined boundaries within an agent’s workflow. By annotating methods such as tool calls or LLM invocations with a boundary annotation, Chronicle captures detailed traces of each step, including metadata like model versions and sampling parameters. This comprehensive recording enables developers to visualize and analyze the exact sequence of events that led to a failure, facilitating precise debugging and root cause analysis.

Beyond recording, Chronicle supports replaying recorded runs with selective stubbing of nodes, allowing developers to isolate and test fixes in a controlled environment. For example, after identifying a faulty tool call, engineers can stub other parts of the agent while running the corrected tool live to verify the fix. This approach also enables automated assertions on outputs, integrating testing directly into the debugging workflow and ensuring that fixes are validated against real failure scenarios.

Finally, the presenters highlight key takeaways for productionizing AI agents: avoid chasing impossible determinism, log all relevant session variables, capture the full context beyond just prompts, use replay to debug and test, and preserve model randomness to maintain creativity and agency. They provide access to Chronicle’s code and resources, encouraging teams to adopt these practices to improve reliability and reduce on-call burdens in AI agent deployments.