Timothy Karen explains how the new multi-token prediction (MTP) feature in Llama CPP significantly boosts token generation speed by around 25% without sacrificing accuracy, making local AI model inference faster and more accessible without needing additional hardware. He emphasizes that such software optimizations are key to advancing efficient, sustainable local AI usage, enabling users to run powerful models on their own devices without relying on cloud services.
In this video, Timothy Karen, an AI enthusiast and creator of the open-source AI application Anything LLM, discusses a significant software improvement in the popular local model-running tool, Llama CPP. The update introduces multi-token prediction (MTP), a feature that can boost token generation speed by around 25% or more without any trade-offs in accuracy. This enhancement is hardware agnostic, meaning it can benefit users regardless of their hardware setup, although better hardware naturally yields better results. Timothy emphasizes that software optimizations like MTP are crucial for advancing local AI capabilities, even without new model inventions.
Timothy explains the concept of MTP by first introducing speculative decoding, where a smaller model predicts tokens ahead of a larger model to speed up inference. However, this approach requires running two models simultaneously, which is resource-intensive and not feasible for many local users. MTP simplifies this by enabling multi-token prediction within a single model, eliminating the need for a secondary draft model. This makes the technology more accessible to a broader audience using tools like Llama CPP, which recently integrated MTP support.
To use MTP, users must download updated versions of models that support this feature, such as the Deepseek V3 and V4 series, Neotron 3 Super and Ultra, and Quinn 3.5 and 3.6. Timothy notes that older models without MTP support will not work with the latest Llama CPP versions that include MTP, so users need to update both their models and software. He also cautions that some specialized models, like mixture of expert (MOE) models, may not see significant performance gains from MTP.
Timothy demonstrates the performance improvements with benchmarks, showing that enabling MTP with one token predicted ahead can increase token generation speed from 45 to about 55 tokens per second, roughly a 25% improvement. Predicting more tokens ahead (e.g., three or six) can sometimes reduce performance due to increased computational overhead and error rates, so users should tune the MTP settings based on their hardware and needs. Overall, MTP offers a straightforward way to gain faster inference speeds with no loss in output quality.
In conclusion, Timothy highlights the importance of ongoing software innovations like MTP in making local AI more efficient and sustainable. He expresses a vision where users can run powerful AI models on their own devices without relying on cloud services, which he sees as economically and ecologically unsustainable. For Timothy, AI is a practical tool to assist with everyday tasks rather than a futuristic AGI, and improvements like MTP help unlock the potential of AI on existing hardware, making it more accessible and efficient for everyone.