Opening — Why this matters now

AI benchmarking is quietly facing a credibility crisis.

Every major language model claims progress on standardized benchmarks—math reasoning, coding, scientific problem‑solving. But there is a persistent suspicion underneath many impressive results: what if the model has simply seen the answers before?

This problem, known as data contamination, occurs when evaluation questions appear in the model’s training data. Once contamination happens, benchmark scores stop measuring reasoning ability and start measuring memorization.

To combat this, researchers have proposed clever statistical tools designed to detect contamination even when the training data is hidden. One such approach—Contamination Detection via output Distribution (CDD)—looks at how consistent a model’s outputs are when sampled repeatedly.

The intuition sounds reasonable: if a model memorized an answer, it will produce nearly identical outputs every time.

But the paper analyzed here reveals something uncomfortable for the entire field:

A model can learn from contaminated data without memorizing it—and when that happens, CDD fails completely.

For organizations evaluating AI systems, this finding changes how we should think about trust, benchmarking, and model auditing.


Background — The contamination detection problem

Large language models are trained on enormous datasets—often hundreds of billions of tokens scraped from the internet. Because these datasets are opaque and constantly evolving, verifying whether evaluation benchmarks were included in training is extremely difficult.

Researchers have therefore developed post‑training contamination detection methods.

These methods attempt to infer contamination using only model behavior.

Major detection paradigms

Approach Core idea Data access required
N‑gram overlap Look for exact text overlap between evaluation data and training corpus Training data required
Perplexity detection Seen examples have unusually low perplexity Model probabilities
Min‑k% Prob Lowest‑probability tokens are still unusually predictable Model probabilities
CDD (output distribution) Memorized answers cause repeated identical outputs Only generated text

CDD became attractive because it works even with black‑box models. Instead of examining internal probabilities, it measures how similar multiple generated outputs are to each other.

If a model keeps producing almost the same answer under randomness, the output distribution becomes “peaked”—a sign of memorization.

Mathematically, CDD measures the proportion of generated outputs whose edit distance from a greedy reference output is small:

$$ Peak(M;x)=\frac{1}{n}\sum_{i=1}^{n} I(ED(s_i,s_{t=0}) \leq \alpha l) $$

If the peakedness score crosses a threshold, the prompt is flagged as contaminated.

Elegant. Clean. Black‑box compatible.

Unfortunately, reality is messier.


Analysis — The experiment that broke the intuition

The authors conducted a controlled contamination study using Pythia models ranging from 70M to 410M parameters. Models were fine‑tuned on datasets where contamination levels were precisely controlled.

Datasets used

Dataset Domain Avg solution length
GSM8K Math word problems ~98 tokens
HumanEval Code generation ~69 tokens
MATH Competition mathematics ~193 tokens

The key experimental variable was fine‑tuning capacity.

Three regimes were tested:

Fine‑tuning method Trainable parameters Description
LoRA r=8 ~0.1–0.2% Very low‑capacity adaptation
LoRA r=256 ~4–6% Moderate capacity
Full fine‑tuning 100% Maximum capacity

Training duration was also varied (3 vs 20 epochs).

This setup created 72 experimental conditions across model sizes, training regimes, and contamination levels.

The goal was simple:

Determine when contamination detection actually works.

The answer was not simple.


Findings — The memorization threshold

The experiments uncovered a sharp and surprising phenomenon: a memorization threshold.

Below this threshold, contamination exists but cannot be detected by CDD.

Above the threshold, detection suddenly becomes highly accurate.

Detection accuracy across fine‑tuning regimes

Fine‑tuning configuration CDD accuracy Interpretation
LoRA r=8 (3 epochs) ~50% Chance level — CDD fails
LoRA r=256 ~0.92 Strong detection
Full fine‑tuning ~0.96–0.98 Very strong detection

The key insight:

Learning from data does not imply memorization of data.

When LoRA rank is small, the model learns patterns and structure of the task but does not store exact answers.

As a result:

  • Output samples remain diverse
  • Edit distances stay large
  • Output distribution does not collapse

CDD therefore sees no signal at all, even though the training data contains the evaluation examples.

This creates a dangerous illusion.

The model is contaminated, but the detection method reports everything is fine.


A closer look — Learning vs memorization

The paper illustrates the difference with concrete examples.

Low‑capacity fine‑tuning (LoRA r=8)

The model produces different answers each time it is sampled:


Greedy output: 42 Sample output: 38

The outputs follow the reasoning format but vary widely. The model has learned the structure of the task without memorizing the answer.

High‑capacity fine‑tuning

Outputs collapse to the exact same sequence every time:


Greedy output: 29 Sample output: 29

This is classic memorization.

Only in this regime does CDD successfully detect contamination.


Why probability‑based methods work better

One of the most striking results in the study is that simpler probability‑based detection methods consistently outperform CDD.

Across 27 experimental conditions:

Method Conditions above chance
CDD 5 / 27
Perplexity 24 / 27
Min‑k% Prob 25 / 27

The reason is subtle but important.

CDD observes external behavior—the similarity of generated text.

Probability‑based methods examine the internal probability distribution.

Even when the model has not memorized an answer, fine‑tuning still shifts token probabilities, making contaminated prompts statistically easier for the model.

Therefore probability signals appear before memorization occurs.

CDD only activates after memorization has already happened.

In other words:

Detection signal Appears when
Probability signals Model has learned the example
CDD signal Model has memorized the example

That difference turns out to be crucial.


Implications — A blind spot in AI auditing

The paper highlights a serious practical problem.

Parameter‑efficient fine‑tuning—especially LoRA—is now the dominant way organizations adapt foundation models.

Ironically, this makes contamination harder to detect.

Why?

Because LoRA deliberately restricts the number of trainable parameters, limiting memorization.

That sounds like a safety feature, but it also means detection methods that rely on memorization signals stop working.

Practical consequences

Scenario Risk
Benchmark evaluation Inflated scores may go unnoticed
Model auditing False assurance of clean training
Research comparisons Misleading leaderboard improvements
Compliance review Hidden benchmark leakage

For companies deploying AI systems, this means benchmark results cannot be trusted without careful contamination auditing.

CDD alone is not sufficient.


A deeper lesson for AI evaluation

The most interesting takeaway is philosophical rather than technical.

Modern AI systems blur the boundary between learning and memorization.

Traditional thinking assumed that contamination necessarily produced verbatim recall. This paper shows that assumption is wrong.

A model can:

  • Learn patterns from contaminated data
  • Improve benchmark performance
  • Avoid memorizing the exact answers

When that happens, the contamination becomes statistically invisible to some detection methods.

This is not a bug in the model.

It is a limitation in how we measure model behavior.


Conclusion — Trust, but verify (and measure correctly)

The study reveals an uncomfortable truth for AI evaluation: contamination detection methods can fail silently.

CDD works only when training produces strong memorization. But modern fine‑tuning techniques—especially parameter‑efficient ones—often produce learning without memorization.

When that happens, contamination remains real but undetectable through output‑distribution analysis.

For practitioners, the lesson is straightforward:

  • Never rely on a single contamination detection method
  • Prefer probability‑based detection where possible
  • Interpret benchmark results cautiously

In the race to build better AI models, measuring progress accurately may be just as difficult as achieving it.

And occasionally, the model knows the answer—without ever remembering it.

Cognaptus: Automate the Present, Incubate the Future.