From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind

Thor Schaeff from Google DeepMind presents the Gemini series of AI audio models, which excel in advanced audio understanding, multilingual transcription, emotion detection, and customizable speech synthesis, enabling rich, context-aware conversational applications that operate efficiently even on edge devices. He also showcases real-time multimodal speech-to-speech capabilities with Gemini 3.1 Flashlight and demonstrates DeepMind’s music generation technology Lyra 3, highlighting innovative integrations like the Life Jukebox that combine speech and music for interactive AI-driven media experiences.

Thor Schaeff from Google DeepMind presents an overview of the latest advancements in AI audio technology developed at DeepMind, focusing on their Gemini series of models. Since joining the team just before the release of Gemini 3, Thor highlights the recent launch of Gemini 4, which incorporates multimodal capabilities including audio understanding that can operate efficiently on edge devices. The Gemini models excel not only in transcription but also in deeply understanding audio nuances such as emotion, pacing, accents, and overlapping speech, enabling robust reasoning across multiple languages and dialects.

A key demonstration involved the Gemini 3 Flash preview model, which can analyze audio recordings in a single API call to extract detailed information such as speaker identification, timestamps, language detection, emotion classification, and translations. This structured output allows developers to build rich applications that go beyond simple transcription, capturing the full context and emotional tone of conversations. The model’s ability to handle multilingual and overlapping speech scenarios showcases its advanced audio comprehension capabilities.

In terms of speech generation, DeepMind’s approach differs from traditional text-to-speech systems by using a smaller set of base voices that can be dynamically modified through “director’s notes” to convey specific accents, emotions, and styles. This is powered by the underlying audio understanding in Gemini models, allowing for highly customizable and context-aware voice synthesis. Examples include generating an Irish-accented voice or a Singaporean English accent, demonstrating the model’s flexibility and realism in speech performance.

Thor also introduces Gemini 3.1 Flashlight, a real-time, multimodal speech-to-speech model capable of ingesting text, audio, and video inputs via web sockets and producing real-time audio responses along with transcripts. Unlike traditional pipelines that separate audio processing and language understanding, this model integrates intelligence directly into the audio processing, enabling more natural and responsive conversational AI. Developers can experiment with this technology through Google AI Studio without cost, using provided code examples and agent skills to facilitate integration.

Finally, Thor showcases DeepMind’s music generation capabilities with Lyra 3, which can create music with lyrics using two models: a short jingle generator and a full-length song generator. He demonstrates an interactive application called Life Jukebox that combines Gemini’s real-time conversational abilities with Lyra’s music generation to produce a custom German techno Schlager song about the UK startup scene. This highlights the potential for AI to create rich, multimodal media experiences that blend speech, music, and real-time interaction, marking a significant step forward in AI audio technology.