Kyutai STT & TTS - A Perfect Local Voice Solution?

The video reviews Kyutai’s newly released speech-to-text and text-to-speech models, highlighting their impressive speed, accuracy, and voice cloning capabilities using pre-made embeddings, though the actual voice cloning model remains unreleased due to privacy concerns. Despite some limitations, these compact models offer promising local voice processing solutions with creative features like voice blending, and the host anticipates further advancements and broader language support in the future.

The video revisits the Kyutai project, focusing on its recent releases of speech-to-text (STT) and text-to-speech (TTS) models. The host recalls an earlier video where they explored Moshi, an integration of automatic speech recognition (ASR) feeding into a large language model and then outputting through TTS. Initially, the model was somewhat limited in intelligence but showed promise due to its low latency in processing speech input and generating speech output. Fast forward a few months, Kyutai has now officially released separate STT and TTS models, primarily supporting English and French, which demonstrate impressive speed and accuracy, especially when run on capable hardware.

The speech-to-text model is highlighted for its quick and accurate transcription capabilities, though it currently supports only English and French. The TTS model, which is a relatively small 1.6 billion parameter model, offers multiple voices and generates speech output rapidly. The quality of the TTS is compared favorably to other well-known systems like Chatterbox, Dier, and 11 Labs. A standout feature is its ability to perform voice cloning from a 10-second voice sample, capturing intonation and voice characteristics effectively, even with unusual voices.

However, the video expresses some frustration that Kyutai has not released the voice embedding model used for cloning, citing privacy concerns to ensure voice cloning is consensual. Instead, they provide a repository of pre-made voice embeddings derived from datasets like Espresso and VCTK. This limitation restricts users from fine-tuning or creating new voice clones, and the models currently only support English and French, though Kyutai is exploring ways to expand language support in the future.

The host then demonstrates how to use the released models and voice embeddings through code examples from Kyutai’s GitHub repository. They show how to load different pre-made voice embeddings, generate speech from text, and even blend two voice embeddings to create a hybrid voice. This blending showcases the potential for creative voice manipulation, although creating custom voice embeddings would require more data and the actual voice cloning model, which remains unreleased. The host hopes that community interest might encourage Kyutai to release the cloning model eventually.

In conclusion, the Kyutai STT and TTS models represent a significant advancement in local voice processing technology, offering efficient and high-quality speech recognition and synthesis in a compact form. While the lack of voice cloning model release is a drawback, the available tools still provide exciting opportunities for experimentation, especially with voice blending and multi-voice support. The host looks forward to future developments, including potential MLX versions for easier local deployment, and encourages viewers to engage with the project and share their thoughts. Overall, Kyutai’s progress marks a promising step toward versatile, low-latency local voice solutions.