Reinforcement learning (RL) has recently emerged as the favored path to boost large language models’ reasoning abilities. The latest headline-grabbing claim? That even random or incorrect reward signals can help models like Qwen2.5 become better reasoners.
But a new paper, “Reasoning or Memorization?”, cuts through the hype—and it does so with scalpel-like precision. It reveals that what we thought were signs of emergent reasoning in Qwen2.5 might, in fact, be a textbook case of data contamination. If true, the implications are serious: much of what we thought we knew about RL-driven reasoning gains could be little more than sophisticated memory retrieval.
The Setup: Qwen2.5, RLVR, and Surprising Gains
The Qwen2.5 family of models—especially the math-tuned variants—has delivered impressive scores on benchmarks like MATH-500, AIME, and AMC. Even more remarkable, some RL protocols using random or spurious reward signals seemed to boost performance further, contradicting long-held assumptions that RL requires accurate feedback.
That mystery led the authors to investigate two hypotheses:
- Qwen2.5 just has better baseline math skills, and even bad reward signals act as regularization.
- Data contamination—Qwen2.5 may have memorized benchmark solutions during pretraining on large-scale web corpora.
Only one of these explanations held up under scrutiny.
The Contamination Test: Partial-Prompt Completion
To detect leakage, the authors applied a clever trick: feed the model just the first 40-60% of benchmark questions and check how well it completes the rest. The result? Qwen2.5 could reconstruct the remaining text and still arrive at the correct answer over 50% of the time on MATH-500. LLaMA-3, by contrast, scored under 5%.
The smoking gun came from LiveMathBench, a newer benchmark compiled after Qwen2.5’s release. Here, Qwen’s partial completion and answer accuracy plummeted to near-zero—right alongside LLaMA’s.
Benchmark | Qwen2.5-Math-7B 60% EM | LLaMA3.1-8B 60% EM |
---|---|---|
MATH-500 | 54.6% | 3.8% |
LiveMathBench | 0.0% | 0.0% |
These figures suggest that much of Qwen’s earlier success on math benchmarks stems not from reasoning, but from rote memory.
RandomCalculation: A Clean Slate
To go further, the authors created RandomCalculation, a synthetic dataset composed of randomly generated multi-step arithmetic problems guaranteed to be out-of-distribution.
On this clean slate, the story flipped: when trained using accurate reward signals (e.g., numerical closeness to the correct answer), Qwen2.5 gradually improved, surpassing even its own Max@16 zero-shot upper bound. But with random, inverted, or noisy rewards? Performance became unstable or collapsed entirely.
This demolishes the claim that Qwen2.5’s success under spurious reward was due to inherent reasoning strength. Instead, it was contaminated data giving the illusion of progress.
Implications: A Crisis in Benchmarking?
This study is more than a Qwen exposé. It’s a warning for the entire RL-for-LLM field:
- Benchmarks can lie. If your model memorized the test during pretraining, reward signals become irrelevant.
- Reward-fidelity matters. On leakage-free datasets, only accurate feedback leads to improvement.
- Multi-model comparison is essential. The LLaMA baseline helped surface that Qwen’s advantage was not universal.
We’ve seen similar crises before—in computer vision, in adversarial examples, and in the early days of reading comprehension. This one strikes at the core of LLM evaluation: are we testing reasoning, or regurgitation?
Where Do We Go from Here?
The authors offer two practical solutions:
- Synthetic benchmark construction, like RandomCalculation, to ensure test-time purity.
- Reward sensitivity audits, testing whether models truly learn or merely reinforce their biases.
For those deploying RL to align reasoning behavior—especially in finance, law, or science—this study is a reminder to trust, but verify.
Cognaptus: Automate the Present, Incubate the Future.