ChatTTS - Conversational TTS Step by Step

artesia · 13 June 2024 13:00

The video introduces ChatTTS, a new Text-to-Speech system focused on conversational text trained on a large dataset of Chinese and English conversation data. It allows users to customize speech outputs by manipulating text inputs with special tokens to control aspects like speaker voice, speed, and paralinguistic features, with plans for an open-source release to enable further development and customization.

artesia · 13 June 2024 13:20

The video discusses a new Text-to-Speech (TTS) system called ChatTTS that focuses on conversational text rather than narration or speeches. It is trained on a large dataset of Chinese and English conversation data, and while there are claims of being trained on 10 million hours of data, the actual amount may be closer to 100,000 hours. The system supports both English and Chinese languages and offers the potential for customization for specific applications or voices. However, it is noted that the model can be slow to run, and the licensing currently restricts commercial use, although there are plans to release an open-source version for further development.

To utilize ChatTTS, users can interact with its code available on GitHub and Hugging Face, where the model is stored. By installing the required packages and loading the models, users can perform batch inference to generate speech from text inputs. The video demonstrates how users can manipulate the text input to control aspects like the speaker’s voice, speed, and paralinguistic features. By setting a specific speaker, users can maintain consistency in the generated voice and reuse the same speaker embedding for future interactions.

The video delves into the concept of refined text, where users can input text with special tokens to guide the prosody and paralinguistic features of the generated speech. By adding tokens like laughter or breaks in the text, users can influence the speech output’s rhythm and intonation. These tokens can be experimented with to achieve desired effects in the generated voice. The narrator showcases how refined text can impact the generated speech by introducing breaks and laughter at specific points in the input text.

ChatTTS offers users the ability to fine-tune the generated speech by manipulating various parameters and tokens within the text input. It provides a platform for experimenting with different voices, speeds, and prosodic features to create customized speech outputs. While the current licensing limits commercial use, the creators plan to release an open-source version that will allow for further development and customization. The video concludes by highlighting the potential of ChatTTS as an impressive TTS system that may see more widespread adoption once an open-source version is available.

Overall, the video provides a comprehensive overview of ChatTTS, detailing its training data, customization options, and potential for refinement. Users are guided through the process of setting up and using the system, including controlling speaker selection and refining text inputs for desired speech outputs. The narrator emphasizes the system’s capabilities, such as creating natural-sounding voices and incorporating prosodic elements into generated speech. Despite current limitations on commercial use, the future release of an open-source version holds promise for further development and widespread application of ChatTTS.