The NEW Best ASR - NVIDIA Nemotron 3.5 ASR

NVIDIA’s Nemotron 3.5 is a powerful 600 million parameter streaming ASR model capable of real-time transcription in 40 languages, featuring cache-aware streaming for faster processing and word boosting to improve accuracy on uncommon terms without retraining. It also supports speaker diarization within the NeMo framework, making it a versatile and efficient solution for multi-language, real-time speech-to-text applications.

The video introduces NVIDIA’s new streaming Automatic Speech Recognition (ASR) model, Nemotron 3.5, a 600 million parameter checkpoint capable of transcribing 40 languages from a single model. Unlike previous self-hosted ASR models that were better suited for offline or batch processing, Nemotron 3.5 is designed for real-time streaming applications. It features a cache-aware streaming mechanism that caches encoder states to avoid redundant computation on overlapping audio chunks, significantly improving processing speed—up to 17 times faster on high-end GPUs like the NVIDIA H100.

Nemotron 3.5 supports multiple languages with 19 languages working out-of-the-box, 13 at production level, and 8 additional languages requiring fine-tuning for optimal performance. The model uses conditional generation based on language codes, allowing it to auto-detect or be set to a specific language. While punctuation and capitalization in streaming mode are not perfect, the model delivers strong transcription accuracy across various latency settings, from 80 milliseconds to over one second, allowing users to balance between transcription speed and chunk size.

A standout feature of Nemotron 3.5 is word boosting, a decode-time technique that increases the likelihood of specific words or phrases appearing in the transcription without retraining the model. This is particularly useful for uncommon words such as product names, drug names, or personal names that the model might otherwise misrecognize. The video demonstrates how word boosting can correct transcription errors for names like “Sam Witteveen” and terms like “Qwen,” enhancing accuracy in real-time streaming scenarios.

The video also touches on speaker diarization capabilities within the NeMo framework, which can segment audio by speaker and assign speaker labels. Although live diarization was challenging to get working perfectly, it performs well for recorded content like podcasts. The system can capture speaker embeddings to identify speakers if sample audio is available, enabling more personalized and accurate speaker attribution in transcriptions.

Overall, the presenter praises Nemotron 3.5 for its ease of local deployment, versatility in handling multiple languages, and advanced features like cache-aware streaming and word boosting. The model is seen as a powerful tool for replacing traditional speech-to-text stacks, especially for streaming use cases. The video concludes with an invitation for viewers to share their experiences and feedback, and hints at the possibility of a second channel focused on deeper, code-oriented tutorials related to ASR and NVIDIA’s speech technologies.