Why (Senior) Engineers Struggle to Build AI Agents — Philipp Schmid, Google DeepMind

Philipp Schmid from Google DeepMind explains that building AI agents requires engineers to shift from traditional deterministic software development to an iterative, trust-based process focused on dynamic instruction refinement, error recovery, and evaluation over multiple runs. He emphasizes designing agent-ready APIs, embracing non-determinism, and adopting a mindset of continuous rebuilding to create reliable, resilient AI agents.

Philipp Schmid from Google DeepMind discusses the challenges engineers face when building AI agents, highlighting how the development process differs significantly from traditional software engineering. Unlike conventional software, where developers create detailed specifications, write code, test, and deploy, building AI agents involves an iterative loop of defining instructions, running the agent, observing its behavior, and refining prompts or tools. This shift means engineers act more like dispatchers setting goals rather than traffic controllers dictating exact steps, allowing agents to find their own paths to achieve objectives, sometimes in unexpected ways.

A key difference is that text and context have become the new state in AI agents, replacing rigid data structures and Boolean flags. Large language models (LLMs) understand semantic meaning, enabling more flexible and dynamic interactions. For example, agents can interpret nuanced instructions or preferences, such as adjusting temperature units based on user context, which traditional software struggled to handle. This shift requires trusting the model to manage complex, stateful workflows dynamically rather than relying on deterministic, predefined paths.

Error handling in AI agents also demands a new approach. Instead of treating errors as failures that require restarting processes, errors should be treated as inputs that the agent can incorporate and learn from to continue its task. This is crucial because agent workflows can be lengthy and computationally expensive, making it impractical to restart from scratch after every failure. Designing for recovery and resilience is essential to maintain progress and efficiency in agent operations.

Testing AI agents requires moving away from traditional unit tests toward evaluation-based methods. Since agents are inherently non-deterministic, the same input may not always produce the same output. Success is measured by reliability over multiple runs rather than absolute correctness in a single test. Evaluations often involve subjective assessments, sometimes using LLMs or human experts as judges, to determine if the agent’s output meets the desired goals. This approach helps ensure agents are dependable enough for real-world applications despite their variability.

Finally, Philipp emphasizes that AI agents and their APIs must be designed differently from traditional software interfaces. Agents rely on semantic understanding of function schemas and documentation rather than years of developer knowledge. APIs need to be self-explanatory and “agent-ready” to facilitate effective interaction. Overall, engineers must embrace trust combined with verification, design for recovery, evaluate rigorously, and accept that software built with AI agents will be disposable and continuously rebuilt as models improve. This mindset shift is crucial for successfully developing reliable AI agents.