The Gemini Interactions API

The video introduces Google’s Gemini Interactions API, which advances beyond simple chat models to support agent-based interactions with features like optional server-side conversation history, multimodal inputs and outputs, function calling, and tool integration, enabling more efficient and complex AI applications. It also highlights the API’s ability to handle asynchronous background tasks and structured responses, while noting some limitations with citation URL handling, marking a significant evolution in Google’s AI developer tools.

The video introduces the new Gemini Interactions API released by Google, highlighting its evolution in response to changing uses of large language models (LLMs). Initially, APIs like OpenAI’s completions API were simple text-in, text-out models without conversation memory or role distinctions. This evolved into chat completions APIs that introduced roles such as system, user, and assistant, enabling better context management for conversations. Later, function calling and structured response schemas became important as models began supporting agents and multimodal inputs and outputs, setting the stage for the Gemini Interactions API.

Google’s Gemini Interactions API is designed with agents in mind rather than just models, reflecting the trend that users interact with systems capable of multiple model calls, tool usage, and code execution rather than simple chat models. A key feature is optional server-side conversation history, allowing developers to persist state without resending all previous messages, improving token efficiency and cost. The API also supports background execution for long-running tasks, enabling asynchronous processing without maintaining a client connection.

The API supports rich multimodal capabilities, allowing inputs such as images, audio, video, and PDFs, and outputs including generated images and audio. The video demonstrates how to encode images and audio in base64 for input and how to specify output modalities like images using models such as Gemini 3 Pro. Structured outputs are simplified through integration with Pydantic models, enabling developers to define JSON schemas for complex, nested responses, which is useful for tasks like moderation or multi-part text generation.

Function calling and tool integration are also enhanced. Developers can pass in lists of tools, including built-in ones like Google Search, code execution, and URL context tools. The API supports remote MCP servers, allowing external services to be integrated as tools. The video shows examples of code execution for mathematical calculations and using a weather service via an MCP server. However, the presenter notes frustration with Google’s handling of URLs in citations, which use redirect URLs rather than raw URLs, limiting usability for exporting or sharing clickable links.

Finally, the video highlights the ability to call agents directly through the API, specifically showcasing the deep research pro preview agent. This agent can run tasks in the background, returning detailed, well-cited responses asynchronously. While the agent performs well, the presenter reiterates concerns about citation URLs. Overall, the Gemini Interactions API unifies multiple functionalities, supports persistent server-side state, and opens the door for more advanced agent-based applications, marking a significant step forward for Google’s AI offerings and developer tools.