Is This The Fastest ASR?

IBM’s Granite series includes three advanced ASR models that balance accuracy, features, and speed, with the Base model excelling in accuracy and multilingual support, the Plus model adding speaker diarization and precise timestamps, and the 2BN model achieving ultra-fast transcription through a novel non-autoregressive editing approach. These models are efficient enough to run locally on various GPUs and represent significant progress in ASR technology, offering tailored solutions for diverse transcription requirements.

IBM has been developing an intriguing suite of open models called the Granite series, which includes language, vision, speech, and embedding models. Among these, their speech models, particularly the automatic speech recognition (ASR) models, stand out. IBM offers three distinct ASR models, each designed with different strengths to address various transcription needs, potentially surpassing popular models like Whisper and Parakeet in certain aspects.

The first model, Granite Speech 4.1 Base, is a strong all-rounder with about 2 billion parameters. It leads the open ASR leaderboard with a low word error rate (WER) of 5.33%, meaning it accurately transcribes about 95% of words across diverse datasets. It supports seven languages for transcription and offers features like bidirectional speech translation, punctuation, true casing, and keyword biasing, which helps improve recognition of domain-specific terms. This model is ideal for general-purpose transcription with high accuracy and speed.

The second model, the Plus variant, adds speaker diarization, enabling it to label different speakers in a conversation, which is valuable for podcasts or meetings. It also provides word-level timestamps with high accuracy, surpassing some specialized models like Whisper X. Additionally, it supports incremental decoding for handling long audio by processing chunks with overlapping context. However, it supports fewer languages, drops translation capabilities, and has a slightly higher word error rate compared to the base model.

The third model, Granite Speech 4.1 2BN, focuses on extreme throughput using a non-autoregressive architecture. Unlike traditional autoregressive models that generate tokens sequentially, this model uses a novel editing approach called Non-AutoRegressive LLM-based Editing (NLE), which first produces a draft transcript and then refines it through editing. This allows for massively faster transcription speeds—up to 1820 times real-time on powerful GPUs like the Nvidia H100—transcribing an hour of audio in just a couple of seconds. However, it lacks features like translation, keyword biasing, speaker diarization, and timestamps.

The video also covers practical aspects of running these models locally, highlighting that they are relatively small and can run on various GPUs, though some features require specific setups like flash attention. The presenter shares personal experiments with transcription quality, timestamp accuracy, and strategies for handling long-form audio and keyword biasing. Overall, IBM’s Granite speech models represent a compelling advancement in ASR technology, offering a range of options tailored to different transcription needs, and the presenter encourages further exploration and development in this space.