Researchers Are Getting Really Creative Training LLMs

The video discusses innovative training methods for large language models, highlighting Meta’s multi-token prediction (MTP) approach that improves performance by predicting multiple tokens simultaneously but faces challenges in tuning and applicability to smaller models. It then introduces Token Order Prediction (TOP), a promising auxiliary objective that teaches models to predict the relative order of future tokens, enhancing understanding of grammar and syntax while outperforming MTP and traditional next-token prediction on many NLP benchmarks.

The video explores innovative approaches to training large language models (LLMs), focusing on a concept called multi-token prediction (MTP) introduced by Meta in April 2024. Traditionally, LLMs predict one token at a time to generate text, but MTP trains models to predict multiple tokens simultaneously. This forces the model to look further ahead in the text, potentially improving its ability to generate coherent and contextually relevant responses. Experiments showed that a 13-billion parameter model trained to predict four tokens at once solved 12% more problems on HumanEval and 17% more on MBPP benchmarks compared to standard next-token prediction models.

However, MTP has its limitations. While it excels in tasks that benefit from looking ahead, such as math calculations and coding, its accuracy drops on standard natural language processing (NLP) tasks like semantic analysis. Additionally, tuning MTP is challenging because the optimal number of tokens to predict depends on the specific task, and predicting too many tokens at once can increase training loss. Smaller models, especially those under one billion parameters, tend to perform worse with MTP, indicating that this approach is more suitable for larger models.

An interesting variation of MTP was proposed by Deep Seek, who used it as an auxiliary training objective rather than the main prediction task. In this setup, the model is trained with MTP but generates tokens one at a time during inference. This approach provides some benefits but also introduces noise in the learning signal because predicting exact sequences of multiple tokens can be difficult and sometimes nonsensical. Despite these challenges, the idea of encouraging models to consider future tokens remains valuable.

Building on these ideas, the video introduces a new concept called Token Order Prediction (TOP). Unlike MTP, which requires predicting exact future tokens, TOP asks the model to predict a relative order of upcoming tokens within a fixed window. This auxiliary objective helps the model learn which tokens are likely to appear sooner or later without needing to guess the exact sequence. TOP is easier for the model to learn and can improve its understanding of grammar and syntax by capturing relationships between tokens spaced a few steps apart. Experiments show that TOP outperforms both next-token prediction and MTP on most NLP benchmarks, especially in larger models.

Overall, while TOP is still an early experiment, it shows promise as a softer and more effective auxiliary objective compared to MTP. It helps models think about future tokens in a less noisy and more structured way, potentially leading to better performance on a range of tasks. The video encourages further research and ablation studies to fully understand the benefits of TOP and its comparison with other methods. The presenter also invites viewers to share their thoughts and subscribe to their newsletter for more updates on cutting-edge AI research.