We’re introducing three audio models in the API

OpenAI has introduced three new real-time audio models in their API, including GPT Realtime Translate for seamless live translation across 70 languages, and GPT Realtime 2, an intelligent voice agent capable of reasoning, multitasking, and integrating with various systems for natural, context-aware interactions. These models significantly advance voice technology by enabling fluent multilingual communication and actionable voice-driven workflows, enhancing productivity and user experience across diverse applications.

OpenAI is introducing three new real-time audio models in their API, designed to enhance voice intelligence and interaction. In the demo, two of these models are showcased: GPT Realtime Translate and GPT Realtime 2. GPT Realtime Translate offers live translation capabilities, allowing users to speak in one language while the model translates and outputs audio in another language in real time. The demo highlights the model’s ability to handle multiple languages seamlessly, including switching between languages mid-conversation and incorporating technical terms without difficulty. This model supports translation across 70 languages, making it a powerful tool for breaking down language barriers in various applications such as media, customer support, and education.

The GPT Realtime Translate model works by listening to the speaker and beginning translation as soon as key words or verbs are detected, resulting in a natural conversational flow that mimics a dialogue between two people. The live audio output is captured directly from the device, with no edits, providing an authentic demonstration of the model’s capabilities. This real-time responsiveness and accuracy make it feel magical and highly practical for global communication needs.

The second model, GPT Realtime 2, focuses on intelligent voice agents capable of reasoning and taking actions based on user instructions. The demo features a personal voice assistant that can access and interpret calendar information, respond to queries, and perform tasks such as updating a CRM system. This model supports parallel tool calling and reasoning, allowing it to manage multiple tasks simultaneously while keeping the user informed through preambles that explain ongoing processes. This ensures a smooth and transparent interaction, even when actions take a few seconds to complete.

A key feature of GPT Realtime 2 is its ability to maintain conversational context and stay engaged without interrupting the user until prompted. This creates a natural and fluid dialogue experience, where the voice agent listens continuously and responds appropriately when addressed. The model’s integration capabilities allow it to connect with various systems, dashboards, services, and devices, enabling it to act within existing workflows and products seamlessly.

Overall, these new real-time audio models represent a significant advancement in voice technology, enabling live translation, intelligent reasoning, and actionable voice interactions. They open up possibilities for voice to become a primary interface across many domains, enhancing communication and productivity. OpenAI expresses excitement about the potential applications developers will create using these models, emphasizing their ability to preserve context, support multiple languages, and integrate deeply with user systems.