Memory Games: The Data Contamination Crisis in Reinforcement Learning

TL;DR for operators

A model that improves after training on random rewards has not necessarily discovered a secret route to reasoning. It may simply be remembering the exam.

The paper behind this article investigates a strange result in reinforcement learning for large language models: Qwen2.5 models appeared to improve on public math benchmarks even when the reward signal was random, inverted, or based on wrong majority-voted answers.¹ That sounds exciting, in the same way that a finance team “beating forecast” after seeing next quarter’s numbers is exciting. Technically impressive, commercially dangerous, and not something one should build governance around.

The authors argue that the effect is largely explained by data contamination. Qwen2.5 appears able to reconstruct parts of public benchmark questions and recover correct answers from incomplete prompts. On MATH-500, for example, Qwen2.5-Math-7B reconstructs the missing continuation with 54.60% exact match when given only the first 60% of the problem, and still reaches 39.20% exact match when given only the first 40%. In partial-prompt answer accuracy, the same model reaches 63.8% on MATH-500 with 80% of the question and 41.2% with only 40%. That is less “reasoning under uncertainty” and more “the model has seen the worksheet.”

To test whether Qwen’s strong mathematical capacity alone explains the random-reward gains, the authors build RandomCalculation, a clean benchmark of generated arithmetic expressions with 1 to 20 computation steps. On this benchmark, the magic fades. Correct reward signals produce steady improvements. Random and incorrect rewards become unstable, marginal, or destructive. Inverted rewards rapidly degrade performance.

For business users, the lesson is not “avoid Qwen” or “RL is fake.” The useful lesson is narrower and more practical: public benchmark gains are weak evidence unless the benchmark is clean, the evaluation is replicated across model families, the base-model ceiling is understood, and the improvement transfers to fresh private tasks. Benchmark theatre is still theatre, even when the actors wear lab coats.

The suspicious result is not the conclusion; it is the crime scene

The puzzle begins with a result that sounds almost too convenient. Reinforcement learning with verifiable rewards, or RLVR, normally depends on a reward function that can tell correct from incorrect answers. In math tasks, this is attractive because final answers can often be checked mechanically. A correct answer receives reward; an incorrect answer does not. Elegant, cheap, and less dependent on a separate reward model.

But recent work had reported something stranger: Qwen2.5-Math-7B could improve on standard math benchmarks even under spurious rewards. The paper tests three such reward variants:

Reward signal	What it does	Why it should be suspicious
Random reward	Assigns reward independently of correctness	Should not consistently teach mathematical reasoning
Inverted reward	Rewards incorrect answers and penalises correct ones	Should actively damage reasoning if the reward is followed
Majority-voted incorrect reward	Rewards agreement with an incorrect answer selected from model outputs	May teach consistency with wrong outputs, not correctness

On MATH-500, random reward and majority-voted incorrect reward still noticeably boost Qwen2.5-Math-7B. Llama3.1-8B-Instruct does not show the same durable benefit and may degrade. That model-family asymmetry matters. If random reward were discovering some general hidden reasoning pathway, one would expect it to generalise more cleanly across comparable model families. It does not.

The authors also find a more mundane confound: prompt formatting. Applying the official chat template substantially degrades performance for the Qwen base models. Some of the apparent RL gains therefore look like recovery from a poor starting configuration rather than new reasoning skill. The model learns how to behave under the template, and the curve starts looking better. Very convenient, but not the same thing as learning mathematics.

This is the paper’s first move: it refuses to treat the benchmark curve as self-explanatory. A rising accuracy line can mean ability, format adaptation, memory retrieval, or some untidy blend of all three. The paper then asks which explanation survives cleaner tests.

Partial prompts turn benchmark leakage into something measurable

Data contamination is often discussed vaguely, as if it were a smell. This paper makes it testable.

The authors introduce two diagnostics:

Diagnostic	Question it asks	What a high score suggests
Partial-prompt completion rate	Can the model reconstruct the missing part of a benchmark question from only its prefix?	The model may have memorised the problem text
Partial-prompt answer accuracy	Can the model produce the correct answer from an incomplete problem?	The model may be retrieving the answer rather than solving the visible problem

This is a useful operational move. It does not require access to the pretraining corpus, which most buyers and researchers do not have. It probes the deployed model directly. If a model can complete an exam question from a fragment and then supply the right answer, one does not need a philosophical debate about “emergent reasoning.” One needs a cleaner exam.

The numbers are difficult to wave away. On MATH-500, Qwen2.5-Math-7B shows 65.80% exact-match completion when given 80% of the problem, 54.60% with 60%, and 39.20% with 40%. Qwen2.5-7B, not even the math-specialised variant, also shows substantial exact-match completion on the same benchmark: 40.20% with 80%, 21.20% with 60%, and 8.20% with 40%.

The same pattern appears on AMC and AIME2024. It weakens sharply on newer or cleaner benchmarks such as AIME2025, MinervaMath, and LiveMathBench. On LiveMathBench, Qwen2.5-Math-7B’s exact-match completion falls to 0.00% at 60% and 40% prompts, and its partial-prompt answer accuracy drops close to Llama’s level. That contrast is the point. The model is not merely “good at math fragments.” It is unusually good at fragments from certain public benchmarks.

The paper also inspects generated outputs. In several examples, Qwen produces coherent reasoning chains and syntactically valid Python-like code, though the code is not actually executed. This matters because memorisation does not always look like copying. It can look like a polished solution path. The model may not be reproducing raw text; it may be reconstructing the familiar pattern of question, solution, and answer. Benchmark leakage can wear a very convincing suit.

The clean benchmark removes the magic trick

A good contamination argument needs more than a leakage probe. It needs a counterfactual: what happens when the model faces problems it could not have memorised?

The authors build RandomCalculation for exactly that purpose. The dataset generator creates arithmetic expressions using integers from 0 to 100, fractions, squares, cubes, and the four basic arithmetic operations. It generates 20 sub-datasets, from 1-step to 20-step calculations, with 1,000 unique problems per sub-dataset. The task is deliberately unglamorous: evaluate the expression step by step and return the final value.

This is not meant to be a grand benchmark for mathematical creativity. It is a controlled instrument. Its virtue is freshness and verifiability. Because the problems are generated after the model release, the benchmark is far less likely to be in the pretraining corpus. Because the answers are mechanically checkable, the reward signal can be made precise. In other words, it is boring in exactly the way an evaluation benchmark sometimes needs to be boring.

Zero-shot performance behaves as one would expect on a clean multi-step calculation task. As computation steps increase, performance declines. The model’s difficulty rises with the task’s actual complexity. That is already different from public benchmark fragments where the model can recover answers from partial prompts.

The authors then run RLVR on 5-step and 10-step RandomCalculation subsets, using 700 training problems and 300 validation problems for each. Since exact binary reward is too sparse for high-precision decimal answers, they design a continuous reward between 0 and 1 that penalises absolute and relative error. This is an implementation detail with interpretive importance: the clean task is not set up to make RL fail trivially. The reward is shaped so that learning has a reasonable signal.

The result is blunt. Correct rewards lead to steady improvement and allow Qwen2.5-Math-7B to surpass a Max@16 reference from the initial model. Random and incorrect rewards do not provide reliable improvement. Inverted rewards collapse performance quickly. Llama3.1-8B-Instruct shows a similar broad pattern, though it fails to surpass Max@16 even with correct rewards.

This distinction is the paper’s central evidence. Qwen may genuinely have stronger mathematical capacity than Llama. The paper does not deny that. But stronger capacity is not enough to explain why spurious rewards improve public benchmark performance. Once memorisation is removed, reward alignment starts mattering again. Shocking development: teaching with wrong answers is not, in fact, a robust pedagogy.

The mechanism is memory amplification, not random-reward wisdom

The paper’s explanation is not merely “contamination exists.” It proposes how spurious reinforcement learning can turn contamination into apparent progress.

The RL method used is Group Relative Policy Optimization, or GRPO. In simplified terms, GRPO samples multiple outputs for a prompt, compares their rewards within the group, and updates the model toward outputs with relatively higher reward. Under random reward, the reward signal should be noise. But the optimisation objective can still have an exploitation bias: high-probability tokens may continue to be up-weighted because the clipping boundary does not penalise them in the same way.

That matters if the correct answer tokens are already high-probability because the model has seen the problem and solution before. On a contaminated benchmark such as MATH-500, random reward can accidentally reinforce the model’s existing high-probability memory trace. It does not need to teach the model how to solve the problem. It only needs to help the model retrieve, format, and stabilise what is already latent.

On RandomCalculation, answer-relevant tokens are not already high-probability in the same way. The model has not memorised those exact generated expressions. Random reward therefore has little useful structure to amplify. The optimisation process becomes unstable or ineffective.

The paper supports this with response similarity analysis. After RL, the authors compare model outputs before and after training using ROUGE-L and token-level KL distance. On MATH-500, responses remain much more similar to the initial model’s responses than they do on RandomCalculation. The reported ROUGE-L similarities are:

Dataset	Reward signal	ROUGE-L similarity before vs after RL
MATH-500	Correct	0.555
MATH-500	Random	0.601
MATH-500	Majority-voted incorrect	0.563
RandomCalculation, 5 steps	Correct	0.225
RandomCalculation, 5 steps	Random	0.247
RandomCalculation, 5 steps	Majority-voted incorrect	0.251
RandomCalculation, 10 steps	Correct	0.193
RandomCalculation, 10 steps	Random	0.251
RandomCalculation, 10 steps	Majority-voted incorrect	0.279

The interpretation is not subtle. If post-RL outputs on MATH-500 stay lexically close to pre-RL outputs, especially under random reward, then the update is likely stabilising familiar response patterns. On RandomCalculation, lower similarity suggests that correct reward pushes the model into more genuinely altered reasoning traces. The appendix’s token-probability analysis points in the same direction: answer-related numeric tokens on MATH-500 remain high-probability before and after random-reward RL, while clean RandomCalculation outputs are more dispersed.

So the paper’s replacement explanation is precise: random rewards do not teach reasoning; they can exploit memorised answer priors when contamination has already loaded the dice.

The evidence stack is stronger because each test has a different job

The paper works because it does not rely on one dramatic chart. Each experiment plays a different evidentiary role.

Evidence	Likely purpose	What it supports	What it does not prove
Spurious-reward RL on MATH-500	Main puzzle replication	Qwen improves under random or wrong-looking rewards on public math benchmarks	That Qwen learned new reasoning
Template and decoding comparisons	Confound check	Some “gain” may be recovery from poor prompt-template setup or base-model underestimation	That all gains are only formatting
Partial-prompt completion	Leakage diagnostic	Qwen can reconstruct benchmark text from fragments	Exact source of contamination
Partial-prompt answer accuracy	Leakage diagnostic	Qwen can recover answers from incomplete prompts	That every correct answer is memorised
RandomCalculation construction	Clean counterfactual	Fresh generated tasks reduce memorisation risk	Full coverage of mathematical reasoning
RLVR on RandomCalculation	Main causal contrast	Correct rewards help; spurious rewards fail or collapse	That all RLVR methods behave identically
Output similarity and KL analysis	Mechanism evidence	MATH-500 gains resemble memory retrieval more than new reasoning	A complete theory of GRPO dynamics
Qwen3 and LiveCodeBench appendices	Robustness and exploratory extension	Similar contamination-like patterns may extend beyond Qwen2.5 math	Broad proof across all code and reasoning domains

This is the kind of evidence stack business teams should prefer. Not because every piece is definitive, but because the pieces attack different failure modes. A single benchmark curve can be beautiful and useless. A diagnostic suite is less glamorous, which is usually a sign it might survive contact with reality.

What the paper directly shows

The paper directly shows four things.

First, public math benchmark performance can be misleading for models exposed to massive web-scale pretraining corpora. Qwen2.5 models show unusually strong partial-prompt reconstruction and answer recovery on MATH-500, AMC, and AIME2024, while the pattern is much weaker on newer or cleaner benchmarks.

Second, spurious-reward improvements on MATH-500 are not sufficient evidence of new reasoning. The gains are model-specific, benchmark-specific, and entangled with prompt-template effects.

Third, on a freshly generated arithmetic benchmark, reward correctness matters again. Qwen2.5-Math-7B can improve under correct rewards, but random, inverted, and majority-voted incorrect rewards do not produce stable, reliable gains. Inverted reward is particularly damaging, as it should be.

Fourth, the mechanism is plausibly memory retrieval amplified by the RL objective. High-probability answer-related tokens on contaminated benchmarks can be reinforced even under random reward, while clean tasks lack the same memorised probability structure.

None of this says Qwen is weak. In fact, the paper suggests Qwen has stronger mathematical capacity than Llama in the tested setup. The accusation is not incompetence. It is evaluation contamination. A model can be genuinely capable and still have its benchmark score inflated by seeing the test.

What Cognaptus infers for business use

The business inference is broader than the paper’s experimental scope, so it should be stated separately.

If you are buying, fine-tuning, or benchmarking LLM systems, public benchmark gains should be treated as audit leads, not procurement evidence. The higher the business consequence of the deployment, the less acceptable it is to rely on public leaderboard numbers alone.

A practical evaluation workflow should include four checks:

Operator question	Diagnostic to run	Decision relevance
Could the benchmark be memorised?	Partial-prompt completion and partial-prompt answer recovery	Detect whether benchmark performance may reflect exposure
Is the gain above the base model’s real ceiling?	Compare against greedy, pass@k, template variants, and no-template baselines	Avoid mistaking prompt-format adaptation for learning
Does the gain transfer to fresh tasks?	Use private, newly generated, or time-split evaluation sets	Test capability rather than familiarity
Is the effect model-family specific?	Replicate on multiple model families and sizes	Separate method robustness from model idiosyncrasy

This matters most in enterprise settings where the model’s task resembles a private exam: underwriting rules, customer-support policies, compliance checks, engineering diagnostics, procurement classification, financial reconciliation. If the model was tuned and selected on stale public tasks, its benchmark record may say little about its behaviour on your internal edge cases.

The ROI point is also slightly uncomfortable. A contaminated benchmark can make a weak training method look cheap and effective. That can distort vendor selection, fine-tuning budgets, and internal model governance. The cost is not just academic embarrassment. It is paying for a method that optimises yesterday’s test instead of tomorrow’s workflow.

The clean-evaluation checklist for RL fine-tuning claims

For teams evaluating RL-based post-training claims, the paper suggests a compact checklist.

Require a leakage probe before celebrating benchmark gains. Ask whether the model can reconstruct problems or answers from partial prompts. If it can, the benchmark is not clean enough to carry a strong causal claim.
Compare against the best base-model configuration. Include greedy decoding, sampled decoding, pass@k, with-template and without-template variants where relevant. A fine-tuned model beating a badly prompted base model is not evidence of meaningful learning. It is evidence that someone found a worse baseline. Congratulations, but quietly.
Use generated or private validation tasks. RandomCalculation is narrow, but the principle generalises. For enterprise work, generate fresh cases from internal schemas, hold out post-training tasks, and periodically rotate evaluation sets.
Separate answer accuracy from reasoning novelty. Correct answers matter, but they are not enough. Compare pre- and post-training outputs. If the fine-tuned model produces nearly the same traces as the base model on public benchmarks, the gain may be retrieval stabilisation.
Test whether wrong rewards still help. This sounds odd, but it is a useful stress test. If random or inverted rewards improve a model on your benchmark, do not announce a miracle. Audit the benchmark.

The boundary is narrow, but important

The paper’s evidence is strongest for Qwen2.5 and Qwen3-style math evaluation under RLVR/GRPO-like training. It also includes comparisons with Llama3.1 and preliminary evidence from LiveCodeBench, where Qwen2.5-Math-7B can reproduce 56.59% of problems when given 80% of the prompt, compared with 4.40% for Llama3.1-8B. That coding result is suggestive, not a complete theory of code benchmark contamination.

The RandomCalculation benchmark is deliberately controlled. It tests arithmetic reliability and multi-step calculation under clean generation. It does not cover theorem proving, symbolic creativity, scientific reasoning, planning, or messy enterprise decision workflows. A model that passes RandomCalculation is not automatically enterprise-ready. A model that fails to transfer from public benchmarks to RandomCalculation, however, has revealed something useful.

The authors also note computational limits. They do not evaluate every RL algorithm, every model family, or every benchmark. The correct conclusion is therefore not “all RL reasoning gains are contamination.” The correct conclusion is “some surprising RL gains can be contamination, and the burden of proof belongs to the benchmark claim.”

That burden is often missing in AI marketing. Funny how that happens.

The real contribution is an audit habit

The most valuable part of this paper is not RandomCalculation itself, though it is useful. It is the audit habit: when a result looks magical, ask whether the model has seen the trick before.

For operators, the operational standard should be simple. A benchmark is not clean because it is famous. A reward method is not valid because the curve slopes upward. A model is not reasoning merely because it can produce a coherent explanation. Coherent explanations can be retrieved, rehearsed, and reformatted. The model may be doing mathematics; it may also be performing archaeology on its pretraining corpus.

The paper gives a disciplined way to tell the difference. Partial prompts expose memory. Fresh generated tasks test transfer. Reward ablations reveal whether the optimisation signal is meaningful. Output similarity checks whether training created new behaviour or polished old traces.

That is the right posture for enterprise AI evaluation: less leaderboard worship, more contamination accounting. The future of model governance will not be won by the team with the prettiest benchmark slide. It will be won by the team willing to ask whether the slide is measuring intelligence, memory, or merely a very expensive case of deja vu.

Cognaptus: Automate the Present, Incubate the Future.

Mingqi Wu et al., “Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination,” arXiv:2507.10532, 2025, https://arxiv.org/abs/2507.10532. ↩︎

TL;DR for operators#

The suspicious result is not the conclusion; it is the crime scene#

Partial prompts turn benchmark leakage into something measurable#

The clean benchmark removes the magic trick#

The mechanism is memory amplification, not random-reward wisdom#

The evidence stack is stronger because each test has a different job#

What the paper directly shows#

What Cognaptus infers for business use#

The clean-evaluation checklist for RL fine-tuning claims#

The boundary is narrow, but important#

The real contribution is an audit habit#