I Decoupled Attention from Weights - Gemma 4 26B

The presenter demonstrates a novel approach where the attention mechanism of the Gemma 4 26B model runs locally on a laptop while the large feed-forward network (FFN) expert weights are hosted remotely across multiple machines, enabling distributed model inference despite network latency. This decoupling allows large models to run on consumer hardware by offloading the computationally intensive FFN to remote servers, potentially democratizing access to large language models through flexible, networked deployment.

In this video, the presenter demonstrates the Gemma 4 26B model running at around 24 tokens per second on a laptop, but reveals that only the attention mechanism is running locally. The large expert weights, which constitute the bulk of the model, are hosted remotely on separate machines and communicate with the local attention via HTTP over the network. This setup allows the model to be distributed across multiple machines, even ones located miles apart, while maintaining reasonable performance.

To establish a baseline, the presenter first benchmarks the Gemma 3 4 billion parameter model running entirely locally on a GPU, achieving about 83 tokens per second. Running the Gemma 4 26B mixture of experts model fully locally on the laptop yields a slower speed of around 22 tokens per second. The key insight is that the transformer architecture processes data sequentially through layers, with attention reading the residual stream and the feed-forward network (FFN) applying the model’s knowledge. The FFN is the largest part of the model and the main bottleneck in terms of memory and computation.

The innovation lies in decoupling attention from the FFN. Attention, which is relatively small and requires GPU, runs locally on the laptop, while the large FFN expert weights are split into shards and hosted as network services on one or more remote servers. This sharding can be done by experts or layers, allowing flexible distribution of the model across multiple machines. The presenter demonstrates running these shards on local servers and then remotely on cloud instances, showing that the system can still produce correct outputs despite the distributed setup.

However, running the FFN shards remotely over the internet introduces significant latency, reducing token generation speed to about 1.8 tokens per second. This slowdown is due to the sequential nature of transformer inference, which requires multiple back-and-forth communications between attention and FFN layers. Despite this, the demonstration proves that attention and FFN can be fully decoupled and distributed, opening the door to running large models on consumer-grade hardware without requiring all components to be on a single GPU.

The presenter concludes by highlighting the potential of this approach. If the sequential dependency between attention and FFN can be further optimized or parallelized, it could enable fast inference of very large models on modest hardware setups, such as clusters of CPUs or small devices. The FFN part, running efficiently on CPUs, is effectively solved, and only attention requires GPU. This decoupling and distribution strategy could democratize access to large language models by reducing hardware requirements and enabling flexible deployment across networks.