Mati Staniszewski, co-founder of ElevenLabs, discusses how the company was inspired to create expressive, emotional voice synthesis technology to overcome the monotony of traditional dubbing and unlock the potential of voice as a primary AI interface across languages and applications. ElevenLabs has developed advanced audio AI models enabling realistic voice replication, emotional intelligence, and versatile voice agents for uses ranging from customer support to education, while maintaining a remote-first, quality-focused culture that balances innovation with practical deployment.
Mati Staniszewski, co-founder of ElevenLabs, shares the human story behind the company’s founding, tracing it back to his friendship with his childhood friend and co-founder, P. Both from Poland, they were inspired by the monotonous and emotionless voice dubbing prevalent in Polish media, which sparked their vision to create technology that enables expressive, emotional voice synthesis across languages. They recognized the broader potential of audio AI, including breaking language barriers, improving accessibility to audio content, and enabling voice as a primary interface for future humanoid robots and AI systems.
ElevenLabs took a unique approach to building frontier audio models, starting in 2022 when audio AI was still a niche field with few researchers. Unlike other AI domains requiring massive compute and data, audio models were smaller but required extensive transcription and annotation of data, which ElevenLabs focused on solving. They built a remote-first team by recruiting top researchers globally based on their work rather than location, and monetized early to sustain independent development. Over time, they expanded their model suite from text-to-speech to speech-to-text, dubbing, real-time streaming, conversational voice agents, and even music generation, covering the entire audio AI research spectrum.
Staniszewski highlights several breakthrough moments for ElevenLabs, including replicating his own voice with emotion, creating AI that can laugh, and enabling voice translation that preserves the original speaker’s voice, as seen in viral examples involving public figures. The company is now focused on advancing emotional intelligence in voice agents, enabling them to detect and respond to human emotions dynamically, and developing audio general intelligence that can seamlessly combine different audio modalities like narration and singing in one continuous voice.
Regarding voice agents, ElevenLabs sees significant opportunities beyond traditional customer support, including revenue-generating sales interactions and citizen support in education and healthcare. They cite examples like Deliveroo’s voice agents managing restaurant hours and the Ukrainian government’s voice app providing frontline information during the war. The future of voice agents includes highly trusted, authenticated interactions where voice is a secure interface for accessing services, and the potential for interactive educational experiences with AI versions of famous teachers or experts.
On company culture and product development, ElevenLabs maintains small, flat teams with technical talent embedded across all departments to enhance automation and efficiency. They emphasize the importance of quality and domain-specific customization in audio models, balancing short-term revenue needs with long-term investments in data annotation and emotional nuance. Staniszewski also discusses the challenges and opportunities in voice AI, including the current limitations in emotional interaction and music generation, and the evolving landscape where voice agents may negotiate autonomously or communicate with each other, potentially using more efficient, non-human-like communication methods.