The video reveals that Google’s Gemma 4 language model includes a powerful hidden multi-token prediction (MTP) feature, which significantly enhances performance but was removed from public versions due to integration challenges with popular AI tools, remaining accessible only through Google’s proprietary Lite RT framework for edge devices. It explains the technical benefits of MTP and speculative decoding, criticizes Google’s decision to withhold this advancement from the wider community, and encourages viewers to stay informed about practical AI developments.
The video discusses a hidden feature in Google’s Gemma 4 language model related to multi-token prediction (MTP), which was removed from the publicly available versions of the model. MTP is a technique that significantly boosts performance by allowing the model to predict multiple tokens ahead, effectively enabling a form of “time travel” for large language models (LLMs). A Google employee explained that MTP was removed from the Hugging Face versions of Gemma 4 due to integration issues with popular tools like Llama CPP and Transformers, which are widely used to run these models. However, the full-featured models with MTP are still available but only through Google’s proprietary Lite RT framework, designed for edge devices like phones, limiting broader access and adoption.
Lite RT is Google’s on-device framework optimized for running models on limited compute devices, and it supports MTP, providing better performance than the standard public versions of Gemma 4. Despite the framework being open source, the models available publicly lack the necessary code to unlock MTP’s benefits, and the Lite RT models themselves are distributed in a compiled format that cannot be modified. This has led to a disparity in downloads and usage, with the Lite RT versions being far less popular than the standard formats, which has hindered the adoption and optimization of Gemma 4 in the wider AI community.
The video then explains the concept of speculative decoding (SD), the foundational idea behind MTP. Speculative decoding uses a smaller, faster model to predict tokens ahead of a larger, more computationally expensive model, which then verifies these predictions. This approach can significantly increase tokens per second (TPS) by reducing the workload on the larger model. An extension of this, speculative speculative decoding (SSD), involves nested predictions to further speed up token generation. However, these methods require the small and large models to share the same vocabulary and have some limitations, such as potential errors from the smaller model’s predictions.
Multi-token prediction (MTP) builds on speculative decoding by integrating the draft prediction mechanism directly into a single large model, eliminating the need for separate small and large models. This integration offers consistent speed improvements but is more complex to implement and train. MTP was pioneered by models like DeepSeek and Quen, while Google initially developed speculative decoding. Despite its advantages, MTP is not widely supported in popular tools due to its architectural complexity, which likely contributed to its removal from the public Gemma 4 models.
The presenter expresses confusion and disappointment over Google’s decision to exclude MTP from the public Gemma 4 releases, arguing that including it would have been harmless and beneficial for the community. They highlight the irony that Google, which contributed foundational research to these techniques, chose to withhold this performance boost. The video concludes with an invitation to subscribe for more accessible explanations of AI concepts, emphasizing the importance of practical knowledge for those experimenting with local AI models on modest hardware.