The video highlights the release of DeepSeek-R1 V2, an open-source AI model that shows significant performance improvements over its predecessor, including better accuracy, usability, and new features like JSON support and function calling. Additionally, Deepc has developed a smaller, efficient distilled model called post-train 3AB, which achieves top results among open-source models on the AME 2024 benchmark, both available for download and testing.
The video provides an update on the release of DeepSeek-R1 V2, an open-source state-of-the-art AI model announced by Deepc on May 29th. Although they described it as a minor upgrade, the official benchmarks and performance metrics suggest significant improvements over the previous version. The weights for R1 V2 were published a day before the announcement, allowing third-party evaluators to assess its capabilities. The update emphasizes enhancements in front-end functionalities, reduced hallucinations, and new support for JSON output and function calling, marking notable advancements in usability and reliability.
Performance benchmarks reveal that R1 V2 has approximately a 20% performance increase compared to R1 V1. In competitive evaluations, R1 V2 now outperforms the Gemini Pro0506 model about half the time and is only slightly behind OpenAI’s GPT-3.5-based models like GPT-3.0. Its Ader leaderboard score aligns closely with Claude 4, despite being an open-weight model released just recently. These results demonstrate that R1 V2 is rapidly establishing itself as a top contender among open-source models, rivaling some of the most advanced proprietary models.
Third-party benchmarks further highlight R1 V2’s strengths and weaknesses. It ranks as the third-best model on the EQbench for creative writing, just behind OpenAI’s GPT-4 and GPT-4 Mini, although it performs worse than its predecessor R1 and the SV30324 model in this area. For general toy use, it performs better than Gemini 2.5 Pro on certain tests but still lags behind models like Clock for NGBT40. Its coding and scientific reasoning abilities have improved but remain inconsistent, with some outputs appearing strange or questionable, indicating room for further refinement.
The model shows a preference for structured outputs, such as nested bullet points, similar to Gemini, but tends to generate fewer paragraphs, resembling GPT-3. Its reasoning process, while decent on logical questions, can sometimes be questionable. Despite these limitations, R1 V2 maintains respectable performance across various tasks. The generation speed remains slow on the official website, and there is no indication that the context window size has increased, which could impact its usability for longer interactions.
Finally, Deepc has distilled a train-of-thought reasoning process from R1 V2 into a smaller, more efficient model called post-train 3AB. This distilled model achieves state-of-the-art results among open-source models on the AME 2024 benchmark, surpassing Quinn 3AB by 10% and matching the performance of Quen 3 235B thinking. Although it still lags in coding and scientific accuracy, this represents a significant performance boost for an AB model. Both the original and distilled models are available for download on Hugging Face, and users can try R1 V2 on Deepc’s official website.