NOT USING Granite 4.1 ASR - The Fastest ASR?

IBM’s Granite series offers a powerful suite of open Automatic Speech Recognition (ASR) models, including the highly accurate Granite Speech 4.12B, the speaker-diarization-enabled Plus variant, and the ultra-fast non-autoregressive 4.1 2BNAR model optimized for bulk transcription. These models balance speed, accuracy, and specialized features, making them competitive with other tech giants’ offerings and suitable for diverse transcription needs, from general use to structured multi-speaker scenarios.

IBM has been developing an impressive suite of open models under their Granite series, covering language, vision, speech, and embeddings. Despite not receiving widespread recognition, these models are competitive with offerings from other tech giants like Microsoft. Among these, the Granite Vision model excels in document understanding, particularly for OCR tasks with the Docling model. However, this video focuses on the speech models, specifically the Automatic Speech Recognition (ASR) suite, which consists of three distinct models designed for different use cases and bottlenecks.

The first model, Granite Speech 4.12B, is the base model and currently leads the open ASR leaderboard on Hugging Face with a word error rate (WER) of 5.33%, indicating high transcription accuracy across multiple languages including English, French, German, Spanish, Portuguese, and Japanese. It supports bidirectional speech translation, punctuation, true casing, and keyword biasing, which helps improve recognition of domain-specific terms. This autoregressive model is a reliable workhorse for general transcription needs, offering fast processing speeds—transcribing an hour of audio in about 16 seconds.

The second model, the Plus variant, adds speaker diarization, allowing it to label different speakers in a transcript, which is valuable for podcasts or meetings. It also provides word-level timestamps with high accuracy, outperforming some specialized models like WhisperX. Additionally, it supports incremental decoding for handling long audio by processing overlapping chunks while maintaining speaker consistency. However, this model supports fewer languages, drops translation capabilities, and has a slightly higher word error rate, making it ideal for structured transcription tasks where speaker attribution is important.

The third model, Granite Speech 4.1 2BNAR, is a non-autoregressive model focused on throughput and speed. Unlike traditional autoregressive models that generate tokens sequentially, this model uses a novel non-autoregressive LLM-based editing (NLE) approach that edits a draft transcript generated by a frozen encoder. This allows for massively parallel processing, achieving real-time factors up to 1,820 on powerful GPUs like the H100, meaning it can transcribe an hour of audio in just a couple of seconds. While it sacrifices features like translation, keyword biasing, speaker attribution, and timestamps, it is ideal for bulk transcription of large audio datasets.

The video also covers practical aspects of running these models locally, highlighting the need for compatible hardware and software such as GPUs with CUDA and flash attention support. The presenter shares personal experiments with chunking strategies, keyword biasing, and fine-tuning for domain-specific use cases like podcasts or court transcripts. Overall, IBM’s Granite speech models represent a compelling and versatile set of tools for ASR, with promising speed and accuracy, and the hope is that IBM continues to develop and support this open model ecosystem.