Opening — Why this matters now
AI benchmarking is quietly facing a credibility crisis.
Every major language model claims progress on standardized benchmarks—math reasoning, coding, scientific problem‑solving. But there is a persistent suspicion underneath many impressive results: what if the model has simply seen the answers before?
This problem, known as data contamination, occurs when evaluation questions appear in the model’s training data. Once contamination happens, benchmark scores stop measuring reasoning ability and start measuring memorization.
To combat this, researchers have proposed clever statistical tools designed to detect contamination even when the training data is hidden. One such approach—Contamination Detection via output Distribution (CDD)—looks at how consistent a model’s outputs are when sampled repeatedly.
The intuition sounds reasonable: if a model memorized an answer, it will produce nearly identical outputs every time.
But the paper analyzed here reveals something uncomfortable for the entire field:
A model can learn from contaminated data without memorizing it—and when that happens, CDD fails completely.
For organizations evaluating AI systems, this finding changes how we should think about trust, benchmarking, and model auditing.
Background — The contamination detection problem
Large language models are trained on enormous datasets—often hundreds of billions of tokens scraped from the internet. Because these datasets are opaque and constantly evolving, verifying whether evaluation benchmarks were included in training is extremely difficult.
Researchers have therefore developed post‑training contamination detection methods.
These methods attempt to infer contamination using only model behavior.
Major detection paradigms
| Approach | Core idea | Data access required |
|---|---|---|
| N‑gram overlap | Look for exact text overlap between evaluation data and training corpus | Training data required |
| Perplexity detection | Seen examples have unusually low perplexity | Model probabilities |
| Min‑k% Prob | Lowest‑probability tokens are still unusually predictable | Model probabilities |
| CDD (output distribution) | Memorized answers cause repeated identical outputs | Only generated text |
CDD became attractive because it works even with black‑box models. Instead of examining internal probabilities, it measures how similar multiple generated outputs are to each other.
If a model keeps producing almost the same answer under randomness, the output distribution becomes “peaked”—a sign of memorization.
Mathematically, CDD measures the proportion of generated outputs whose edit distance from a greedy reference output is small:
$$ Peak(M;x)=\frac{1}{n}\sum_{i=1}^{n} I(ED(s_i,s_{t=0}) \leq \alpha l) $$
If the peakedness score crosses a threshold, the prompt is flagged as contaminated.
Elegant. Clean. Black‑box compatible.
Unfortunately, reality is messier.
Analysis — The experiment that broke the intuition
The authors conducted a controlled contamination study using Pythia models ranging from 70M to 410M parameters. Models were fine‑tuned on datasets where contamination levels were precisely controlled.
Datasets used
| Dataset | Domain | Avg solution length |
|---|---|---|
| GSM8K | Math word problems | ~98 tokens |
| HumanEval | Code generation | ~69 tokens |
| MATH | Competition mathematics | ~193 tokens |
The key experimental variable was fine‑tuning capacity.
Three regimes were tested:
| Fine‑tuning method | Trainable parameters | Description |
|---|---|---|
| LoRA r=8 | ~0.1–0.2% | Very low‑capacity adaptation |
| LoRA r=256 | ~4–6% | Moderate capacity |
| Full fine‑tuning | 100% | Maximum capacity |
Training duration was also varied (3 vs 20 epochs).
This setup created 72 experimental conditions across model sizes, training regimes, and contamination levels.
The goal was simple:
Determine when contamination detection actually works.
The answer was not simple.
Findings — The memorization threshold
The experiments uncovered a sharp and surprising phenomenon: a memorization threshold.
Below this threshold, contamination exists but cannot be detected by CDD.
Above the threshold, detection suddenly becomes highly accurate.
Detection accuracy across fine‑tuning regimes
| Fine‑tuning configuration | CDD accuracy | Interpretation |
|---|---|---|
| LoRA r=8 (3 epochs) | ~50% | Chance level — CDD fails |
| LoRA r=256 | ~0.92 | Strong detection |
| Full fine‑tuning | ~0.96–0.98 | Very strong detection |
The key insight:
Learning from data does not imply memorization of data.
When LoRA rank is small, the model learns patterns and structure of the task but does not store exact answers.
As a result:
- Output samples remain diverse
- Edit distances stay large
- Output distribution does not collapse
CDD therefore sees no signal at all, even though the training data contains the evaluation examples.
This creates a dangerous illusion.
The model is contaminated, but the detection method reports everything is fine.
A closer look — Learning vs memorization
The paper illustrates the difference with concrete examples.
Low‑capacity fine‑tuning (LoRA r=8)
The model produces different answers each time it is sampled:
Greedy output: 42 Sample output: 38
The outputs follow the reasoning format but vary widely. The model has learned the structure of the task without memorizing the answer.
High‑capacity fine‑tuning
Outputs collapse to the exact same sequence every time:
Greedy output: 29 Sample output: 29
This is classic memorization.
Only in this regime does CDD successfully detect contamination.
Why probability‑based methods work better
One of the most striking results in the study is that simpler probability‑based detection methods consistently outperform CDD.
Across 27 experimental conditions:
| Method | Conditions above chance |
|---|---|
| CDD | 5 / 27 |
| Perplexity | 24 / 27 |
| Min‑k% Prob | 25 / 27 |
The reason is subtle but important.
CDD observes external behavior—the similarity of generated text.
Probability‑based methods examine the internal probability distribution.
Even when the model has not memorized an answer, fine‑tuning still shifts token probabilities, making contaminated prompts statistically easier for the model.
Therefore probability signals appear before memorization occurs.
CDD only activates after memorization has already happened.
In other words:
| Detection signal | Appears when |
|---|---|
| Probability signals | Model has learned the example |
| CDD signal | Model has memorized the example |
That difference turns out to be crucial.
Implications — A blind spot in AI auditing
The paper highlights a serious practical problem.
Parameter‑efficient fine‑tuning—especially LoRA—is now the dominant way organizations adapt foundation models.
Ironically, this makes contamination harder to detect.
Why?
Because LoRA deliberately restricts the number of trainable parameters, limiting memorization.
That sounds like a safety feature, but it also means detection methods that rely on memorization signals stop working.
Practical consequences
| Scenario | Risk |
|---|---|
| Benchmark evaluation | Inflated scores may go unnoticed |
| Model auditing | False assurance of clean training |
| Research comparisons | Misleading leaderboard improvements |
| Compliance review | Hidden benchmark leakage |
For companies deploying AI systems, this means benchmark results cannot be trusted without careful contamination auditing.
CDD alone is not sufficient.
A deeper lesson for AI evaluation
The most interesting takeaway is philosophical rather than technical.
Modern AI systems blur the boundary between learning and memorization.
Traditional thinking assumed that contamination necessarily produced verbatim recall. This paper shows that assumption is wrong.
A model can:
- Learn patterns from contaminated data
- Improve benchmark performance
- Avoid memorizing the exact answers
When that happens, the contamination becomes statistically invisible to some detection methods.
This is not a bug in the model.
It is a limitation in how we measure model behavior.
Conclusion — Trust, but verify (and measure correctly)
The study reveals an uncomfortable truth for AI evaluation: contamination detection methods can fail silently.
CDD works only when training produces strong memorization. But modern fine‑tuning techniques—especially parameter‑efficient ones—often produce learning without memorization.
When that happens, contamination remains real but undetectable through output‑distribution analysis.
For practitioners, the lesson is straightforward:
- Never rely on a single contamination detection method
- Prefer probability‑based detection where possible
- Interpret benchmark results cautiously
In the race to build better AI models, measuring progress accurately may be just as difficult as achieving it.
And occasionally, the model knows the answer—without ever remembering it.
Cognaptus: Automate the Present, Incubate the Future.