Recent research reveals that reinforcement learning from verifiable rewards (RLVR) in large language models primarily amplifies existing reasoning abilities rather than fostering new problem-solving strategies, with fine-tuning affecting only a small subset of model parameters. Additionally, some apparent successes of RLVR may stem from spurious reward signals specific to certain models, underscoring the need for rigorous testing across diverse architectures and a focus on refining rather than expanding model capabilities.
The recent developments in reinforcement learning (RL) applied to large language models (LLMs) have taken an unexpected turn, challenging earlier optimism about RL’s ability to help models discover new reasoning strategies. Traditionally, after pre-training a language model to predict the next word, RL from human feedback (RLHF) is used to fine-tune the model into an instruction-following chatbot by optimizing responses based on human preferences. However, this method relies on subjective reward models and limited labeled data. A newer approach, reinforcement learning from verifiable rewards (RLVR), has gained traction by using deterministic, binary feedback to evaluate correctness, making it especially suitable for domains like math and coding where answers can be objectively verified. RLVR often employs group relative policy optimization (GRPO), which compares multiple outputs from the same model to assign relative rewards, enabling stable and effective fine-tuning.
One of the most exciting early findings with RLVR, particularly demonstrated by DeepSeek R1, was that models seemed to develop self-reflective problem-solving strategies and longer reasoning chains, suggesting that RLVR could foster novel reasoning capabilities. However, subsequent research has cast doubt on these claims. Studies have shown that RLVR primarily amplifies existing knowledge and reasoning paths within the base model rather than creating new ones. For example, longer responses generated during RLVR training were found to be a side effect of the optimization process rather than a direct cause of improved accuracy. Moreover, extensive sampling experiments revealed that base models without RLVR could outperform their RLVR-trained counterparts when given enough attempts, indicating that RLVR sharpens but does not expand the model’s reasoning abilities.
Further investigations into the mechanics of RL fine-tuning revealed that only a small subset of the model’s parameters—often between 5% to 30%—are actually updated during RL training, with the rest remaining untouched. This sparsity is not random but targeted, focusing on specific network components that influence output probabilities. This finding supports the idea that RL fine-tuning acts more like a selective amplifier of certain behaviors already encoded during pre-training rather than a comprehensive rewiring of the model’s reasoning processes. In contrast, supervised fine-tuning tends to update a much larger portion of the model’s parameters, suggesting a fundamentally different mode of learning.
A particularly striking revelation came from research showing that RLVR’s apparent success in some models, especially the Quinn series, may be due to spurious reward signals rather than genuine learning. Experiments demonstrated that even random or incorrect reward labels could improve performance in these models, likely because Quinn models have a strong internal heuristic for code reasoning that RLVR inadvertently amplifies regardless of reward quality. This phenomenon does not generalize to other model families like LLaMA 3, where nonsensical rewards degrade performance. This highlights the critical need to test RL methods across diverse models to avoid misleading conclusions based on idiosyncratic behaviors of specific architectures.
In summary, while RLVR initially promised a new era of reasoning discovery in LLMs, current evidence suggests it mainly enhances existing capabilities rather than creating new ones. The unique strengths of pre-training remain central, and RL’s role may be more about refining and focusing model behavior than expanding its fundamental knowledge. Future progress likely depends on scaling up compute, diversifying training environments, improving reward design, and enhancing exploration strategies to push beyond the base model’s latent abilities. Meanwhile, techniques like distillation show promise for transferring and compressing reasoning skills across models. Overall, the field is recalibrating its expectations for RL in LLMs, emphasizing rigorous validation and generalization across architectures.