Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL

merefield · 1 July 2025 15:50

The lecture reviews advancements in reinforcement learning for language models, focusing on simplifying RLHF through algorithms like DPO and GRPO, and highlights recent successful applications of RL with verifiable, outcome-based rewards to train reasoning models efficiently and stably. It emphasizes the challenges of noisy human feedback in RLHF and presents verifiable rewards as a promising solution for scalable, interpretable, and controllable language model fine-tuning.

merefield · 1 July 2025 16:11

The lecture begins with a recap of Reinforcement Learning from Human Feedback (RLHF), focusing on the Direct Preference Optimization (DPO) algorithm. DPO optimizes language model policies based on pairwise human preference data by rewriting the reward as a ratio of policies and maximizing the likelihood of observed preferences. This approach simplifies RLHF into a supervised learning problem with an alternative parameterization. Variants like SPIO and length-normalized DPO have been explored to improve performance, but empirical results vary significantly depending on the environment, base model, and data, highlighting the challenges of generalizing findings in RLHF research. The lecture also discusses common issues in RLHF such as overoptimization, where models optimize proxy rewards without corresponding improvements in true human preferences, and the tendency of RLHF models to be less calibrated than supervised models.

The discussion then shifts to policy optimization (PO) algorithms, which are foundational in reinforcement learning. PO optimizes expected rewards by sampling from the policy and updating parameters via policy gradients. However, pure on-policy methods are inefficient due to the need for fresh rollouts at every update. Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) address this by allowing off-policy updates with importance sampling and clipping to keep policy updates stable. PPO, widely used in practice, requires maintaining a value function for variance reduction and involves complex implementation details, making it challenging to deploy effectively. The lecture highlights the complexity of PPO implementations, including reward shaping and generalized advantage estimation, which are necessary for stable training but add to the algorithm’s intricacy.

To simplify PPO, the lecture introduces Generalized Reward Policy Optimization (GRPO), which removes the need for a value function and generalized advantage estimation. GRPO uses a simple baseline computed as the mean reward within a group of responses to reduce variance, normalizing advantages by the standard deviation of rewards in the group. This approach is particularly suited for language model RL tasks where multiple candidate responses per prompt form natural groups. While GRPO is simpler and more efficient, some theoretical concerns arise from dividing by the standard deviation and length-normalizing rewards, which can bias training towards overly long or short responses. Empirical studies suggest that removing these biases leads to more stable training and better control over response length.

The lecture then reviews three recent influential papers applying RL with verifiable rewards to train reasoning models: DeepSeek Math (R1), Kimmy 1.5, and Quen 3. R1 demonstrates that outcome-based rewards with GRPO can match OpenAI’s GPT-4 level reasoning performance using a simple RL recipe without complex search or process reward models. It emphasizes the importance of supervised fine-tuning (SFT) on chain-of-thought data before RL to improve interpretability and performance. Kimmy 1.5 adopts a similar approach but uses a different RL algorithm closer to DPO, incorporates curriculum learning by sampling problems based on difficulty, and introduces a length reward to encourage concise reasoning chains. They also address practical infrastructure challenges in RL training, such as synchronizing model weights between training and inference workers.

Finally, Quen 3 builds upon these works by introducing “thinking mode fusion,” allowing a single model to switch between reasoning with and without chain-of-thought prompts, providing a controllable inference-time trade-off between reasoning depth and efficiency. Quen 3 also demonstrates that effective RL fine-tuning can be achieved with surprisingly few examples, highlighting the sample efficiency of RL with verifiable rewards. Across all three papers, the common theme is leveraging verifiable, outcome-based rewards in narrow domains to avoid reward hacking and enable stable, scalable RL training for language models. The lecture concludes by emphasizing that while RLHF remains challenging due to noisy human feedback, RL with verifiable rewards offers a promising path forward for training advanced reasoning models.