Granite 4.0: Small AI Models, Big Efficiency

The video highlights IBM’s Granite 4.0 series of small, efficient AI language models that combine Transformers with the Mamba 2 state space model to deliver high performance, exceptional memory efficiency, and scalable context lengths while running on affordable hardware. These open-source models, ranging from 3 billion to 32 billion parameters with Mixture-of-Experts architectures, offer faster speeds and lower costs compared to larger models, demonstrating a promising shift toward practical, resource-friendly AI solutions.

The video discusses IBM’s Granite series of large language models (LLMs), with a particular focus on the newly released Granite 4.0 family. The speaker shares a personal connection to the Granite models, especially Granite.13B.V2, a 13 billion parameter model released in 2024, which was transparent about its training data, including US patents and IBM documentation that the speaker had contributed to. This transparency and personal relevance made the Granite models particularly meaningful to the speaker. The new Granite 4.0 models promise higher performance, faster speeds, and significantly lower operational costs compared to both previous Granite models and larger competing models.

Granite 4.0 includes several models tailored for different use cases. The Small model is a 32 billion parameter Mixture-of-Experts (MoE) model with 9 billion active parameters, designed for enterprise tasks on a single GPU. The Tiny model, also an MoE, has 7 billion parameters with 1 billion active parameters and is optimized for low latency and edge deployments. Additionally, there are two Micro models with 3 billion parameters each, intended for lightweight local use; one uses a hybrid architecture similar to Tiny and Small, while the other is a traditional Transformer. These models are small, efficient, and designed to run with minimal compute resources.

A key highlight of Granite 4.0 is its exceptional memory efficiency. For example, the Micro model requires only about 10 GB of GPU memory for production workloads involving long context and multi-batch tasks, whereas comparable models typically need four to six times more memory. This efficiency extends to the Tiny and Small models as well, with Granite 4.0’s hybrid design reducing memory requirements by up to 80% while delivering faster speeds and higher performance. The models also maintain high throughput even as batch size and context length increase, unlike many other models that slow down under these conditions.

The architecture behind Granite 4.0 is a hybrid of Transformers and Mamba 2, a state space model (SSM) introduced in 2023. Mamba 2 offers linear scaling of computational needs with context length, unlike Transformers whose requirements scale quadratically. This means doubling the context window only doubles the computational cost with Mamba, rather than quadrupling it as with Transformers. Granite 4.0 combines nine Mamba blocks for every Transformer block, leveraging Mamba’s efficiency for capturing global context and Transformers’ strength in handling local details. The MoE approach used in Tiny and Small models activates only relevant expert subnetworks during inference, further enhancing efficiency.

Finally, Granite 4.0 models use no positional encoding (NoPE), avoiding the limitations of traditional positional encodings like RoPE, which struggle with sequences longer than those seen during training. This design choice, combined with Mamba’s linear scaling, theoretically allows Granite 4.0 models to handle unconstrained context lengths, limited only by hardware and memory. The video concludes by contrasting the trend of ever-larger AI models with the emerging path of smaller, highly efficient models like Granite 4.0 that can run on affordable GPUs. These models are open source and available on platforms like Hugging Face and watsonx.ai, showcasing the potential of small language models in practical applications.