Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Samuel Humeau from Mistral discusses the evolution of text-to-speech models into architectures resembling large language models, emphasizing their application in real-time, low-latency conversational agents and showcasing advanced voice cloning capabilities. He highlights the technical challenges of streaming audio generation, the balance between quality and efficiency, and the ethical considerations surrounding voice cloning, while advocating for continued innovation to make speech interfaces more natural and responsive.

Samuel Humeau from Mistral presents an insightful overview of recent trends in text-to-speech (TTS) technology, highlighting the release of their powerful open-source TTS model. He emphasizes the growing importance of TTS in real-time applications, particularly in conversational agents where latency is critical. The ideal system streams audio as soon as the first tokens of text are generated by a large language model (LLM), reducing perceived delay and enabling more natural interactions. Humeau demonstrates this with a voice cloning example, showcasing how their model can replicate a person’s voice with just a few seconds of audio and even mimic accents or different languages.

The talk delves into the technical evolution of speech generation, from early methods like stitching recorded words to modern approaches that treat audio generation as a sequence modeling problem similar to language modeling. The current dominant architecture uses an autoregressive decoder to generate audio in chunks or patches rather than sample-by-sample, balancing quality and computational efficiency. To handle the vast amount of audio data, models compress audio into tokens representing short frames, enabling the use of transformer-based architectures. Humeau explains that their model uses 37 tokens per 80-millisecond frame, resulting in about 500 tokens per second, which is still a significant amount but manageable with optimized architectures.

Conditioning the audio generation on text input varies across implementations. Some models receive the entire text context upfront, while others handle streaming text input, updating context dynamically during audio generation. Mistral’s released model belongs to the former category, taking the full text and voice sample before producing audio. Despite this, Humeau acknowledges ongoing research into streaming architectures that interleave or dual-stream text and audio tokens to further reduce latency and improve real-time responsiveness, especially for longer utterances.

Humeau also touches on the practical implications of voice cloning technology, noting its increasing accessibility and potential impact on brand identity. Companies are beginning to treat vocal identity as seriously as visual branding, and this trend is likely to become mainstream. However, Mistral currently restricts open access to the voice cloning encoder to prevent misuse, offering only pre-made voices for public use. This cautious approach balances innovation with ethical considerations around voice impersonation.

In conclusion, Humeau highlights the synergy between large language models and TTS systems, where speech interfaces can leverage the powerful capabilities of LLMs for versatile and natural communication. He encourages exploration of their open-source model and technical report for deeper understanding. The future of TTS lies in refining streaming architectures to minimize latency and enhance user experience, making speech a seamless interface for AI agents.