Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

Max Ryabinin from Together AI presents their research on extending transformer model context lengths to 5 million tokens by overcoming memory and computational challenges through advanced optimization techniques like fully sharded data parallelism, context parallelism with DeepSpeed Ulysses, activation checkpointing, and a novel method called Untitled Ulysses. Their work demonstrates that combining these strategies with pipeline parallelism enables efficient training of extremely long-context models, advancing capabilities for applications requiring extensive token contexts.

Max Ryabinin, VP of Research and Development at Together AI, presents their research project focused on extending the context length in transformer-based language models to 5 million tokens. Together AI is an AI-native cloud platform offering services and infrastructure for AI development, including model creation, fine-tuning, reinforcement learning, and inference deployment. Max emphasizes the growing community interest in training models with long context windows, driven by applications like agents that require extensive token contexts and video generation tasks needing temporal consistency over multiple frames.

The main challenges in scaling transformer models to handle very long sequences are twofold: the quadratic computational complexity due to pairwise interactions in attention mechanisms, and the linear growth of memory usage with sequence length. Max illustrates these issues using examples from Hugging Face and explains that simply increasing context length leads to prohibitive memory demands, especially on GPUs. Their goal was to explore existing and novel optimization techniques to push the limits of context length training on hardware like an 8x H100 GPU node.

To address these challenges, Together AI applied several strategies. First, fully sharded data parallelism was used to distribute model parameters across GPUs, reducing memory load but not fully solving the problem due to attention activations. Next, they leveraged context parallelism techniques such as DeepSpeed Ulysses, which partitions attention computation across GPUs by heads, significantly reducing memory usage. Activation checkpointing was employed to recompute activations during backpropagation, further lowering memory demands. Additionally, offloading some activations to CPU memory and chunking element-wise computations helped manage large buffers, enabling training with up to 3 million tokens.

To go beyond 3 million tokens, Together AI developed an advanced optimization called Untitled Ulysses, which refines context parallelism by dividing attention heads into smaller chunks and iterating over them sequentially. This approach reduces peak memory allocation by reusing buffers across iterations without sacrificing throughput. Their experiments demonstrated that this method matches or exceeds the memory efficiency of existing transformer training implementations while scaling to 5 million tokens. Combining these techniques with pipeline parallelism (U-Pipe) allows further memory savings and efficient resource utilization during training.

In conclusion, Max highlights that training models with extremely long context lengths is a complex but achievable goal requiring a combination of memory optimization techniques. He stresses the importance of profiling tools like PyTorch profiler to identify bottlenecks and optimize resource usage. Together AI’s research and public paper provide detailed insights and methods for overcoming memory barriers in long-context transformer training, paving the way for more capable models in applications demanding extensive contextual understanding.