How DeepSeek Built The Current "Best" Math Prover AI

artesia · 2 June 2025 19:21

DeepSeek Prover V2 is the leading AI model for formal mathematical theorem proving, developed through a recursive pipeline that generates a vast dataset of verified formal proofs using models like Lean 4. This approach enables the system to produce rigorously verified proofs, outperforming previous models on challenging benchmarks and highlighting the potential for highly reliable, logically sound AI reasoning in mathematics.

artesia · 2 June 2025 19:41

The video discusses how DeepSeek has developed the current “best” AI model for formal mathematical theorem proving, called DeepSeek Prover V2. Unlike other AI models that focus on approximate reasoning or natural language explanations, this model excels in producing rigorously verified proofs within a formal system. The breakthrough stems from a novel pipeline that synthesizes a high-quality dataset of formal proofs, enabling the model to learn precise logical reasoning. On a challenging benchmark, the Putnam problems, DeepSeek Prover V2 significantly outperforms previous models, solving 49 out of 657 questions, which is nearly five times better than prior state-of-the-art solutions.

The core innovation lies in how the model is trained using a recursive, hierarchical pipeline that leverages formal proof systems like Lean 4. The process begins with a large generalist language model, DeepSeek V3, which sketches proof outlines in natural language and decomposes complex theorems into smaller sub-goals. A smaller, specialized prover model then fills in the details of these sub-goals, breaking them down further if they are too difficult. This recursive approach generates hundreds of verified formal proofs, which are then stitched together to create a rich training dataset called “co-star” data, enabling the large model to learn formal reasoning without human-labeled ground truth.

This synthetic data generation approach is highly efficient because it relies on a two-model setup: the large language model for outlining and decomposition, and a smaller, computationally inexpensive model for filling in proof details. The process produces a vast amount of verified proofs that serve as training data, allowing the model to internalize formal reasoning patterns. After training on this data, the model is further refined through reinforcement learning, where it is rewarded for generating correct proofs and maintaining logical structure, ensuring consistency and accuracy in its reasoning process.

Remarkably, the research reveals that larger models tend to internalize step-by-step reasoning, often producing explanations even when not explicitly prompted. Interestingly, a smaller 7B parameter model outperformed the larger 671B model on certain challenging problems, discovering unique proof tactics involving finite cardinalities that the bigger model did not. This suggests that smaller models can sometimes develop specialized skills that larger, more general models may overlook, highlighting the importance of targeted training and skill discovery in formal reasoning tasks.

Overall, DeepSeek Prover V2 represents a significant advancement in AI for formal mathematics, emphasizing rigorous, verifiable reasoning over approximate solutions. Unlike traditional math-solving models that produce answers or natural language explanations, this system guarantees correctness within a formal proof system like Lean 4. The approach holds promise for future AI developments, potentially leading to more reliable, logically sound reasoning systems that can tackle complex mathematical and scientific problems with unprecedented precision.