Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 9: RL for LLMs

The lecture explores methods for aligning large language models with human preferences, highlighting the limitations of instruction fine-tuning and the advantages of Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) in improving model behavior. It also discusses ongoing challenges such as reward hacking and noisy feedback, emphasizing the evolving nature of techniques to create effective, human-aligned AI assistants.

The lecture begins by discussing the evolution of large language models (LLMs), highlighting the trend of increasing model size and computational power over the past decade. While pre-training LLMs on vast internet text corpora enables them to predict the next token in a sequence, this process imparts not only language understanding but also some grasp of world knowledge and reasoning. However, pre-training alone does not suffice to create effective chatbots or assistants, as these require models to follow instructions and align with human preferences, which is where fine-tuning and reinforcement learning techniques come into play.

Instruction fine-tuning is introduced as a method to adapt pre-trained models to specific tasks by training them on pairs of instructions and desired outputs. This approach improves the model’s ability to generalize and respond appropriately to a wide range of queries, especially as model size increases. Despite its effectiveness, instruction fine-tuning has limitations: it requires expensive labeled data, struggles with tasks lacking a single correct answer (such as creative writing), and treats all errors equally without nuanced feedback. Moreover, the fine-tuning objective may not perfectly align with the goal of assisting users, leading to a mismatch between training and deployment.

To address these challenges, the lecture delves into Reinforcement Learning from Human Feedback (RLHF), which optimizes models based on human preferences rather than fixed labels. RLHF involves training a reward model to predict human preferences between different outputs and then using policy gradient methods to optimize the language model to maximize expected reward. This approach handles non-differentiable human feedback and incorporates a KL divergence penalty to prevent the model from drifting too far from its initial behavior. RLHF has demonstrated significant improvements, sometimes producing outputs preferred over human-written ones, but it is computationally intensive and complex to implement.

An alternative to RLHF, Direct Preference Optimization (DPO), is presented as a simpler and more efficient method. DPO leverages a theoretical connection between the reward model and the language model’s policy, allowing direct optimization of the language model parameters using preference data without an explicit RL loop. By framing the problem as a binary classification task on preference pairs, DPO avoids costly sampling and complex optimization steps while still incorporating a KL constraint to maintain model stability. Empirical results show that DPO performs comparably to RLHF, making it popular in open-source communities for aligning LLMs with human preferences.

The lecture concludes by reflecting on the broader implications and ongoing challenges in aligning LLMs with human intent. Issues such as reward hacking, noisy and inconsistent human preferences, and the difficulty of capturing long-term user satisfaction remain open problems. The field is rapidly evolving, with efforts to improve feedback collection, incorporate multi-faceted evaluation metrics, and personalize models to individual users. Despite these challenges, techniques like RLHF and DPO have become foundational in developing effective, human-aligned chatbots and assistants, marking significant progress in the deployment of large language models.