Introduction

TL;DR

  • The 2025 paper from Tsinghua University rigorously demonstrates that reinforcement learning (RL) with verifiable rewards (RLVR) increases the efficiency of sampling correct answers, but does not add new reasoning behaviors to language models.
  • In pass@k evaluation (whether 1 of k samples is correct), RLVR-tuned models excel at low k but underperform base models at high k, indicating no expansion in reasoning capacity.
  • All correct outputs from RL-trained models were already present in the base model’s distribution.
  • The implication: Surpassing LLM reasoning limits likely requires a fundamentally new paradigm rather than more RL.

RL and Reasoning: What the Study Says

Tsinghua University researchers systematically evaluated RLVR in math, coding, and vision benchmarks across various LLM families. Their key assertion: RLVR “sharpens” the distribution, increasing efficiency without expanding the actual set of correct reasoning paths.[2][4][5][6][1][3]

  • RLVR automates reward signals in tasks with verifiable correctness, such as math or programming
  • On benchmarks like GSM8K, MATH500, LiveCodeBench, they compared base vs. RLVR-tuned models on pass@k metrics
  • RL significantly improved single-shot accuracy (pass@1), but at higher k values, the base model solved more distinct problems
  • Manual inspection confirmed that RLVR never found solutions not already possible for the base model with enough samples

Why it matters:
This deflates the view that RL-fine-tuned LLMs truly “learn to reason” beyond pretraining—instead, RL simply reweights existing strategies.

Experiments and Findings

RLVR appears helpful for optimizing known strategies: the models sample paths that yield rewards more frequently, but do not develop fundamentally novel problem-solving approaches.

  • In-depth experiments across math benchmarks (e.g., AIME24, GSM8K), coding (HumanEval+, MBPP+), and visual reasoning show base models can solve all seen problems with sufficient sampling
  • RL-finetuned models were more deterministic but less diverse, narrowing their reasoning boundary
  • Visual reasoning mirrored math and code: RLVR improved efficiency but not scope
  • Distillation, unlike RLVR, sometimes added new solution types

Why it matters:
The results reveal that fine-tuned LLMs are not inherently “smarter”—they just produce the same set of solutions more efficiently, with potential loss in creative exploration.

Implications and Future Directions

  • RLVR, as currently applied, does not break the base model’s ceiling in reasoning
  • Success in practical benchmarks may mask a reduction in solution diversity and problem-solving range
  • Fundamental advances in LLM reasoning will likely hinge on paradigms beyond simple RL training
  • Scaling up model size, data, and better RL environments are open research questions

Why it matters:
The study challenges the assumption that RL is enough to advance LLM reasoning, underscoring the need for novel learning frameworks.

Conclusion

As of late 2025, RLVR systematically boosts surface-level accuracy but does not truly push the boundaries of LLM reasoning capability. Care should be taken not to conflate more efficient output with genuine conceptual expansion.

Key takeaways:

  • RLVR increases sampling efficiency but not reasoning diversity.
  • All RLVR-correct solutions are within base model capacity.
  • RLVR-tuned models have narrower reasoning boundaries at large k.
  • Future LLM breakthroughs will likely require new training paradigms.
  • Findings have been robustly cross-verified.

Summary

  • RLVR boosts sampling efficiency but does not expand reasoning capacity beyond the base model
  • Pass@k evaluation reveals RLVR-tuned models excel at low k but underperform at high k
  • All correct RLVR outputs were already in the base model’s distribution
  • Future reasoning advances will likely require fundamentally new paradigms beyond current RL approaches

#ReinforcementLearning #LLM #Reasoning #AIResearch #RLVR #Benchmarking #OpenAI #DeepSeek #AILimits #2025

References