Memory Games: The Data Contamination Crisis in Reinforcement Learning

Reinforcement learning (RL) has recently emerged as the favored path to boost large language models’ reasoning abilities. The latest headline-grabbing claim? That even random or incorrect reward signals can help models like Qwen2.5 become better reasoners. But a new paper, “Reasoning or Memorization?”, cuts through the hype—and it does so with scalpel-like precision. It reveals that what we thought were signs of emergent reasoning in Qwen2.5 might, in fact, be a textbook case of data contamination. If true, the implications are serious: much of what we thought we knew about RL-driven reasoning gains could be little more than sophisticated memory retrieval. ...

July 15, 2025 · 3 min · Zelina