POV: Chinese AI Lab Teaching Everyone How To Save Millions of Dollars

The video showcases Den Seed, a leading Chinese AI lab, introducing Pre-trained Model Averaging (PMA), a cost-effective technique that merges model checkpoints during large language model pre-training to save up to 15% of compute resources while maintaining performance. This method smooths out training oscillations, enabling early performance estimation, improved stability, and recovery from training issues, marking a significant advancement in efficient AI model training.

The video discusses the advancements made by Den Seed, a Chinese AI lab affiliated with Bite Dense, Doing, and TikTok, which is rapidly becoming a leading force in AI research, surpassing even Deep Seek. Den Seed has introduced cutting-edge concepts and models, including their video model CEN 1.0, which outperforms Google’s VO3 in video and audio generation. The lab’s research is notable not only for its innovation but also for its substantial budget, enabling it to compete with global giants like Google and OpenAI. The video focuses on one of their revived concepts called model merging, particularly in the context of large language model (LLM) pre-training, an area that has been relatively unexplored due to the high costs involved.

Model merging, traditionally used in image generation to combine styles, is less common in LLM pre-training because of the immense computational expense and risk. Training large models from scratch can cost millions of dollars, making experimental approaches like model merging risky and costly without guaranteed improvements. Moreover, sharing detailed methodologies could potentially aid competitors, which top AI labs typically avoid. However, Den Seed has invested millions to explore model merging during pre-training, demonstrating that it can provide accurate scaling law predictions and free performance boosts, as well as methods to recover training if issues arise mid-process.

The core technique introduced is called Pre-trained Model Averaging (PMA), which involves saving model checkpoints at fixed token intervals during training and averaging these snapshots into a single model. This merged model can approximate the performance of a fully annealed model, effectively previewing final quality while saving significant training time and computational resources—up to 15% of the compute budget and several days of training. PMA was tested on a range of models from hundreds of millions to tens of billions of parameters, including dense and mixture of experts architectures, with results showing that simple moving average (SMA) outperformed more complex averaging methods.

The video explains why model merging works so well: during the constant learning rate phase, model weights oscillate noisily around an optimal point. PMA acts like a low-pass filter, smoothing out these high-frequency oscillations in one step, similar to what learning rate annealing achieves gradually. This means that PMA can replace or reduce the need for the annealing phase, although annealing might still provide marginal gains if extended beyond the tested token budgets. The simplicity and effectiveness of SMA make it a practical choice, preserving useful variance from earlier training stages that more complex methods might overlook.

Finally, the video highlights the practical benefits of PMA, including cost reduction, early performance estimation, and increased training stability. It can save 10-20% of training time and budget during hyperparameter tuning or scaling law experiments and help recover from training instabilities or crashes by averaging stable checkpoints. PMA’s stabilization properties could also benefit distributed training setups by smoothing noise across multiple model copies. The video concludes by emphasizing the significance of this research in the ongoing pre-training era and encourages viewers to follow the creator’s newsletter for more cutting-edge AI research updates. Additionally, the video is sponsored by RunPod, a platform offering serverless GPU services that simplify AI model deployment and training.