Kokoro Local TTS + Custom Voices

artesia · 15 January 2025 14:01

The video introduces the Kokoro 82M local text-to-speech model, highlighting its impressive performance and accessibility for developers seeking alternatives to external APIs. It showcases the model’s capabilities, including the creation of custom voices through blending techniques, and provides a walkthrough for running Kokoro locally, encouraging viewers to explore its potential for building conversational agents without API costs.

artesia · 15 January 2025 14:21

In the video, the presenter discusses the growing interest in local text-to-speech (TTS) systems as developers seek alternatives to external APIs like OpenAI and Google. The focus is on a specific TTS model called Kokoro 82M, which has gained popularity for its impressive performance while being lightweight enough to run on local machines without requiring a GPU. The model is available on platforms like Hugging Face and GitHub, and it has quickly risen to the top of the TTS leaderboard due to its effectiveness.

Kokoro 82M is notable for being trained on less than 100 hours of audio, yet it delivers high-quality results. The model is based on the TTS2 architecture, which is documented in a research paper. The community has begun to explore this model, leading to the development of various external projects, including a GitHub repository for running Kokoro locally and a FastAPI implementation that mimics OpenAI’s speech endpoint. The presenter emphasizes the model’s accessibility and the potential for future enhancements.

The video provides a walkthrough of how to use Kokoro, starting with a Google Colab example to demonstrate its functionality. The presenter explains that the model consists of two main components: the TTS model itself and voice embeddings that define the characteristics of different voices. Users can choose from various voices, including American and British options, as well as voices in other languages like French, Japanese, Korean, and Chinese. The ability to generate audio from text is showcased, highlighting the model’s versatility.

Additionally, the video delves into the process of creating custom voices by blending existing voice embeddings. The presenter explains techniques such as weighted averaging and spherical interpolation to generate new voice characteristics. This allows users to experiment with different voice combinations and create unique outputs. The video also mentions the potential for training models to generate embeddings without retraining the entire TTS model, opening up further customization possibilities.

Finally, the presenter demonstrates how to run Kokoro locally using the Onyx package, which enhances performance and ease of use. The setup process is straightforward, requiring the installation of necessary packages and copying model files. Once configured, users can generate audio quickly and efficiently. The video concludes by encouraging viewers to explore Kokoro and consider integrating it with speech-to-text systems for creating local conversational agents, all without incurring API costs. The presenter invites feedback and interaction from viewers, promoting engagement with the content.