Data Contamination

Cache Me If You Can: Why LLM Benchmarks Need Contamination-Resistant Data

The benchmark score is not the product. The test pipeline is. Benchmarks used to feel like neutral scoreboards. A model sat down, answered questions, received a number, and everyone pretended the number meant generalization. That story became less charming once benchmark questions started appearing in the same public data oceans used to train the models being tested. ...

When the Model Knows but Doesn't Remember: The Hidden Blind Spot in LLM Contamination Detection

Audit. That is the word companies like to use when they want uncertainty to sound disciplined. Model audit. Benchmark audit. Contamination audit. The phrase suggests a clean checklist: run the detector, read the score, decide whether the benchmark is safe. The paper behind today’s article makes that picture less comfortable. It studies Contamination Detection via output Distribution, or CDD, on small language models and finds a simple but awkward failure mode: a model can be trained on contaminated benchmark examples, learn from them, and still avoid the kind of verbatim memorization that CDD is designed to catch.1 ...

Question Banks Are Dead. Long Live Encyclo-K.

Question banks work well until the examinee obtains the question bank. After that, the test still produces scores. It may even produce beautifully precise rankings. What it no longer reliably produces is evidence that the examinee can solve unseen problems. Large-language-model benchmarks face the same awkward lifecycle. A fixed evaluation set is published, discussed, copied into repositories, used in model-development pipelines, and eventually absorbed into training corpora. The benchmark remains visible; its diagnostic value quietly depreciates. ...

Memory Games: The Data Contamination Crisis in Reinforcement Learning

TL;DR for operators A model that improves after training on random rewards has not necessarily discovered a secret route to reasoning. It may simply be remembering the exam. The paper behind this article investigates a strange result in reinforcement learning for large language models: Qwen2.5 models appeared to improve on public math benchmarks even when the reward signal was random, inverted, or based on wrong majority-voted answers.1 That sounds exciting, in the same way that a finance team “beating forecast” after seeing next quarter’s numbers is exciting. Technically impressive, commercially dangerous, and not something one should build governance around. ...