Fighting AI with AI — Lawrence Jones, Incident

Lawrence Jones of incident.io explains how his company uses AI both in their incident response platform and internal tools to manage, debug, and improve complex AI systems through automated testing, analysis, and code refinement. By leveraging AI-driven evals, file system-based debugging, and parallel agent analysis, incident.io enhances performance transparency and scalability while inviting engineers to join their innovative AI SRE team.

Lawrence Jones, a founding engineer at incident.io, discusses how his company leverages AI to manage the complexity of their AI-driven incident response platform. Incident.io helps companies like Netflix and Etsy manage and communicate during production incidents, with a goal to fully automate production investigations. However, building such AI systems is challenging due to their complexity and the difficulty humans face in debugging and understanding their performance. To address this, incident.io uses AI not only in their products but also extensively in their internal tools to monitor and improve these systems.

A key part of their approach involves using evals, which are essentially AI unit tests that check if prompts produce the desired output. These evals live alongside their code and help ensure that any changes to prompts maintain or improve performance. However, managing evals is tricky because realistic test data can be large and unwieldy, making it hard for both humans and coding agents to work with them efficiently. To solve this, incident.io developed a CLI tool called eval tool that allows coding agents to interact with eval suites more effectively, enabling automated prompt testing and refinement through a red-green testing cycle.

Jones highlights the complexity of modern AI systems, which often involve many interconnected prompts, agents, and tools. Debugging such systems is difficult because errors can originate from any part of the hierarchy, and traditional UI tools are not easily usable by AI agents. Inspired by Anthropic’s Claude Code, incident.io converts their UI data into downloadable file systems that coding agents can navigate and analyze. This approach allows agents to understand the full context of an incident, identify the root cause of problems, and suggest precise code changes, which can then be tested and deployed seamlessly.

To handle the scale of thousands of investigations across many customer accounts, incident.io runs daily backtests that aggregate performance metrics but initially lacked insight into why performance changed. They addressed this by creating an AI-driven analysis pipeline using parallel coding agents that analyze investigations, cluster failure types, and generate detailed reports explaining system behavior and suggesting improvements. This pipeline stores incremental analysis in files and integrates with the codebase, enabling automated diagnosis and fixes that are validated through their eval system.

In conclusion, Jones emphasizes the importance of using AI not only in product features but also in internal tooling to manage, debug, and evolve complex AI systems effectively. He advocates for designing debugging tools that work well with coding agents, leveraging file systems for agent context, and creating AI runbooks for repeatable analysis. Finally, he invites interested engineers to join incident.io’s growing team in London to work on cutting-edge AI SRE products, highlighting the exciting opportunities in this emerging field.