Qwen3 Next - Behind the Curtain

The video highlights Qwen 3 Next, an innovative 80 billion parameter sparse MoE model that achieves impressive training efficiency and inference speed by activating only 3 billion parameters during inference and utilizing multi-token prediction. Despite being trained on fewer tokens, it performs competitively with larger models, showcasing significant advancements in open-source Chinese language models and promising future improvements in AI research.

The video discusses the recent rapid advancements and releases from the company Qwen, highlighting their impressive lineup of models including the successful Qwen 3 Koda, a smart translation model, image generation and editing models, and the newly released Qwen 3 ASR. The focus of the video is on the Qwen 3 Next model, an experimental release that, while not necessarily outperforming the largest Qwen 3 Mixture of Experts (MoE) models, shows significant potential in terms of training efficiency and inference speed. The model aims to be faster to train on more tokens and more efficient during inference, making it both quicker and smarter in practical use.

Qwen 3 Next is an 80 billion parameter MoE model but only activates 3 billion parameters during inference, which is a remarkable reduction compared to previous models like Qwen 3 235B that had 22 billion active parameters. This sparsity is achieved through a large number of experts—512 in this model compared to 128 in earlier versions—allowing specialization across different tasks. The model’s hybrid attention mechanism and multi-token prediction capabilities are key innovations, enabling it to predict multiple tokens at once, which improves inference efficiency and aligns with recent research trends in language modeling.

The model was trained on 15 trillion tokens, a substantial amount though less than the 36 trillion tokens used for some other Qwen models. Despite this, Qwen 3 Next performs comparably or better on certain benchmarks while requiring less than 10% of the compute cost of larger models. This efficiency suggests that future versions trained on the full corpus or larger models could set new state-of-the-art standards. The video also notes the availability of different versions of the model, including a “thinking” version that produces longer, more detailed responses, and an instruct version for simpler outputs.

In practical testing, the presenter explores the model’s performance on various tasks, noting its ability to generate thoughtful and coherent responses, often thinking in English even when asked to respond in other languages like Thai or Chinese. The multi-token prediction is evident in streaming outputs, where multiple tokens are generated simultaneously, enhancing speed. The model also shows promise in agentic tasks and function calling, especially when used with Qwen’s own agent framework, though further testing with other frameworks is suggested.

Overall, the video highlights Qwen 3 Next as a significant step forward in open-source Chinese language models, showcasing innovations in sparsity, multi-token prediction, and training efficiency. The openness of Qwen’s experiments contrasts with proprietary models, raising questions about how other labs like Meta will compete. The presenter encourages viewers to try the model themselves and expresses interest in exploring other Qwen models like the ASR in future videos, emphasizing the exciting direction of Chinese AI research and development.