Dia 1.6B TTS for NotebookLM Podcasts

The video introduces Dier, a new text-to-speech model developed by undergraduates Toby and Jay, which boasts 1.6 billion parameters and offers high-quality audio comparable to established systems. The creators, operating as Nari Labs, have made the model accessible on platforms like GitHub and Hugging Face, allowing users to generate audio from text and explore its capabilities, despite some limitations in handling longer inputs.

In the video, the presenter discusses a new text-to-speech (TTS) system called Dier, developed by two undergraduates, Toby and Jay, who aimed to replicate the Notebook LM podcast experience. Despite being relatively inexperienced in AI, they successfully created a TTS model with 1.6 billion parameters, which has garnered attention for its quality, comparable to established systems like those from 11 Labs and OpenAI. The creators, operating under the name Nari Labs, have made their model available on platforms like GitHub and Hugging Face, allowing users to experiment with its capabilities.

The Dier model is described as an open weights text-to-dialogue model that provides users with full control over scripts and voices. The video showcases the model’s impressive audio quality through various examples, including comparisons with other TTS systems. The creators drew inspiration from notable works like Google’s Soundstorm and the Parakeet model, which influenced their approach to building Dier. However, they faced significant challenges, particularly in securing the necessary computational resources for training the model.

To overcome the compute challenge, the creators utilized Google’s TPU research cloud grants, which allowed them to access TPUs for training. They initially wrote the model in JAX to optimize TPU usage but later converted it to PyTorch for broader accessibility. The presenter highlights the commendable effort of the creators in navigating these technical hurdles and successfully making their model user-friendly for the community.

The video also delves into practical applications of the Dier model, demonstrating how users can generate audio from text and even perform voice cloning. The presenter walks through the installation process and showcases how to generate dialogue scripts, emphasizing the model’s ability to produce varied voices. However, some limitations are noted, such as the tendency for the audio to speed up with longer text inputs, which the presenter aims to address through various techniques.

In conclusion, the Dier TTS system represents an exciting development in the field of text-to-speech technology, particularly due to its origins from two passionate undergraduates. While it may not yet be suitable for real-time applications, it excels in generating high-quality audio for podcast-style content. The presenter encourages viewers to explore the model, share their thoughts on its quality compared to other TTS systems, and consider its potential applications in various contexts. The video wraps up with an invitation for feedback and further discussion in the comments.