The lecture explores how deep reinforcement learning can enhance large language models’ reasoning abilities by framing problem-solving as a Markov decision process and using techniques like advantage-weighted regression and Direct Preference Optimization to improve credit assignment and data efficiency. It highlights that combining these RL methods with advanced base models capable of meta-reasoning enables the generation of complex, accurate solution sequences, significantly advancing LLM performance on challenging tasks like math problem-solving.
In this lecture on deep reinforcement learning (RL) for large language model (LLM) reasoning, the focus is on how RL techniques can improve the reasoning capabilities of language models, particularly in solving complex math problems. The traditional approach to training language models involves next-token prediction, where the model learns to predict the next token in a sequence given the previous tokens. While this method works well with large amounts of data, it struggles to reach expert-level reasoning performance due to limited high-quality data, especially for technical problems like those found in math competitions. The lecture uses math problem-solving as a concrete example to explore how RL can address these limitations by framing reasoning as a Markov decision process (MDP), where the model generates a sequence of solution steps and receives a sparse reward based on the correctness of the final answer.
The reasoning problem is modeled as an MDP where the initial state is the problem statement, and each action corresponds to a step in the solution. The model produces a sequence of steps, culminating in a final answer that is rewarded if correct. This setup allows the application of RL methods to improve the model’s reasoning by learning from both correct and incorrect solution trajectories. The lecture introduces three main training paradigms: supervised fine-tuning (SFT) on human-written solutions, rejection fine-tuning (RFT) which filters and trains on only correct solutions generated by the model itself, and full reinforcement learning that can learn from all trajectories, including incorrect ones. Empirical results show that RFT can be twice as data-efficient as SFT, but excessive reliance on self-generated data can hurt generalization due to the presence of spurious or irrelevant steps in the solutions.
To address the issue of spurious steps—incorrect or irrelevant solution steps that do not affect the final answer but degrade model performance—the lecture discusses the use of advantage functions derived from rollout-based Q-value estimates. By estimating the expected future reward from partial solution prefixes, the model can assign credit or blame to individual steps. Steps with positive advantage are considered useful and retained for training, while those with negative advantage are discarded. This approach, akin to advantage-weighted regression in offline RL, helps the model focus on learning correct reasoning patterns and reduces overfitting to spurious steps. Experiments demonstrate that advantage filtering improves test error and reduces the number of spurious steps compared to naive rejection fine-tuning.
Building on advantage estimation, the lecture explores how these ideas integrate with offline RL methods such as Direct Preference Optimization (DPO). Instead of discarding incorrect trajectories, DPO uses preference pairs constructed from rollouts to train the model to prefer better solution continuations. This method retains more data diversity and further improves data efficiency, achieving up to an eightfold reduction in the amount of human-written data needed compared to SFT. The lecture also touches on practical considerations such as how to segment solutions into steps (often by sentences or logical blocks) and the trade-offs between generating high-quality versus diverse solution traces during training.
Finally, the lecture discusses online RL approaches where the model continuously generates solutions, receives sparse or dense rewards, and updates its policy using policy gradient methods like GRPO. Modern “thinking models” such as DeepSeek and OpenAI’s O-series leverage these RL techniques combined with more capable base models that can perform meta-reasoning steps like verification and backtracking. These advances allow models to produce longer, more complex solution sequences and achieve significantly better reasoning performance. Overall, the lecture highlights that while the core RL training objectives remain similar, improvements in base model capabilities and more sophisticated credit assignment methods have driven recent breakthroughs in LLM reasoning.