The video reviews KittenTTS, an ultra-compact, open-source text-to-speech model by Kitten ML that runs efficiently on CPUs and is suitable for browsers and mobile devices due to its small size (as low as 25 MB). While voice quality decreases slightly with smaller models, KittenTTS remains intelligible and practical for lightweight applications, and the presenter highlights its rapid development and potential for broader use.
The video reviews KittenTTS, a new text-to-speech (TTS) model developed by Kitten ML. The presenter highlights the appealing name and the impressively small size of the models, which are all under 25 megabytes. KittenTTS comes in three main versions: Mini (80 million parameters), Micro (40 million parameters), and Nano (15 million parameters), with the smallest also available in an 8-bit quantized version that brings the size down to just 25 MB. The presenter notes that while these models may not produce the highest quality voices compared to larger models like Quent TTS (which has 1.7 billion parameters), their compactness makes them suitable for running on browsers, mobile phones, and other edge devices.
A key advantage of KittenTTS is its CPU optimization, meaning it does not require a GPU to run efficiently. This makes it highly accessible for lightweight deployment scenarios, such as embedding in web applications or mobile devices. The project is open source under the Apache 2 license, allowing for broad usage and modification. Although currently in developer preview, the presenter expects further improvements and more robust releases in the future, especially as the model has seen rapid development from early versions to the current 0.8 release.
The presenter demonstrates how to use KittenTTS via a Google Colab notebook, emphasizing that no GPU is needed for inference. The package allows users to load all model variants and select from several pre-made voices. The voices are created in a style similar to Kokoro TTS, but KittenTTS achieves much smaller model sizes. The demonstration includes audio samples from each model size, showing that while there is some degradation in voice quality as the models get smaller, the loss is not dramatic, and the voices remain intelligible and usable.
Listening tests reveal that the smallest, quantized models introduce some audio artifacts and struggle with natural pauses and punctuation, but still maintain recognizable and consistent voices. Certain voices, such as the “Luna” and “Hugo” voices, perform better at lower bitrates, and the presenter suggests that these models could be ideal for quick-loading browser extensions or lightweight mobile applications. The ability to run these models locally and efficiently opens up new possibilities for TTS on devices with limited resources.
The video concludes by noting that KittenTTS models are distributed as ONNX files with accompanying NumPy files for voice embeddings, similar to other TTS systems. The presenter encourages viewers to support the project, especially since it appears to be largely the work of a single developer. The open-source nature and rapid progress of KittenTTS suggest a promising future for ultra-compact, locally-run TTS systems, and the presenter invites feedback from anyone who has experimented with the models in real-world applications.