The benchmark score is not the product. The test pipeline is.
Benchmarks used to feel like neutral scoreboards. A model sat down, answered questions, received a number, and everyone pretended the number meant generalization. That story became less charming once benchmark questions started appearing in the same public data oceans used to train the models being tested.
The awkward business version is simple. If a vendor says its model performs well on a public benchmark, a buyer now has to ask whether the model solved the task or had already seen some version of the test. This does not mean every benchmark score is fake. It means the score has become a supply-chain artifact. Its reliability depends not only on the model, but also on how the benchmark was released, copied, scraped, filtered, transformed, and possibly memorized.
The paper “LLM Benchmark Datasets Should Be Contamination-Resistant” by Ali Al-Lawati, Jason Lucas, Dongwon Lee, and Suhang Wang argues that the industry is patching the wrong layer.1 The usual defenses try to keep benchmarks private, refresh them, paraphrase them, or remove leaked instances from training corpora. These tactics are not useless. They are also not a stable operating model for public, reproducible, long-lived evaluation.
The paper’s more interesting proposal is mechanical: release benchmark inputs in a latent form that a model can use at inference time, but cannot easily use as training data. In plainer language, do not publish the exam questions in readable text. Publish the part of the model-internal state needed to continue from the question.
That is why this article should not begin with “data contamination is bad.” Everyone in the room already knows that, or at least knows how to say it on a panel. The harder part is understanding the proposed asymmetry: training needs token sequences; inference can proceed from cached internal state. If that asymmetry can be turned into benchmark infrastructure, public evaluation may no longer require public plaintext questions.
The old defenses treat contamination as a cleaning problem
The paper starts from a now-familiar diagnosis: public benchmarks are easy to contaminate because modern LLM training aggressively absorbs public text at enormous scale. Once benchmark samples move through repositories, forums, derived datasets, documentation, tutorials, synthetic-data pipelines, or model outputs, exclusion becomes more difficult than deleting a folder named benchmark_do_not_train_on. Cute folder name, insufficient control system.
The authors summarize several contamination findings from prior work. They cite clean-mirror evidence where a non-public version of GSM8K reduces accuracy by up to 13 percentage points for model families such as Mistral. They also cite contamination reports showing rising contamination levels across benchmark families over time, including examples where detected contamination reaches roughly 45% on commonly used benchmarks, and multilingual benchmarks where major LLMs show contamination levels up to 91.8%.
Those numbers should be read as motivation, not as a new experiment conducted by this paper. The paper is a position and architecture proposal. Its evidence base combines prior contamination studies, Transformer mechanics, and prior work on representation alignment. That distinction matters. The authors are not presenting a finished benchmark platform and proving it beats all alternatives. They are arguing that the release format of benchmark data needs to change.
The competing families of solutions are easy to name:
| Existing response | What it tries to do | Why the paper finds it insufficient |
|---|---|---|
| Private benchmarking | Keep test data away from model developers and the public | Protects data, but limits independent verification, raises cost, and creates bottlenecks around trusted evaluators |
| Dynamic benchmarking | Keep refreshing test data so it postdates training corpora | Helps temporarily, but makes longitudinal comparison harder and creates a moving target that can itself be absorbed later |
| Lexical refactoring | Rephrase, shuffle, perturb, or obfuscate benchmark inputs | Modern LLMs can often survive paraphrases and transformations; refactored data can also become newly contaminated |
| Decontamination/filtering | Remove known benchmark samples from training corpora | Exact matching misses paraphrases and derivatives; semantic filtering has recall/precision trade-offs; trillion-token corpora make the process brittle |
The common theme is that these methods treat contamination as something to clean up after benchmark publication. The paper’s replacement idea is more infrastructural: change the object being released.
This is the first important business translation. If benchmark scores matter for procurement, model selection, audits, fine-tuning decisions, or regulatory-facing claims, then benchmark integrity is not a research nicety. It is part of evaluation governance. A benchmark that becomes training data is like an audit checklist distributed as an onboarding manual. Educational, yes. Independent, not quite.
Contamination-resistant data has to satisfy three properties, not one slogan
The paper formalizes a contamination-resistant dataset as one that remains useful for inference while being unlearnable under current standard LLM training methods. That definition is deliberately stricter than “hard to read” or “not publicly visible.” The benchmark still has to evaluate model behavior. It just should not provide useful training examples when scraped.
The authors break this into three required properties:
| Property | Plain meaning | Operational question |
|---|---|---|
| Irreversibility | It should be computationally difficult or economically impractical to reconstruct the original plaintext input from the released dataset | Can someone recover the question at scale and put it back into a training corpus? |
| Equivalence | The model’s output using the latent benchmark form should approximate the output it would have produced from the original input | Are we still testing the same task, or did the transformation change the exam? |
| Interoperability | A benchmark encoded for one model should be translatable for other LLMs | Can this become shared evaluation infrastructure rather than a benchmark locked to one architecture? |
This three-part framing is the paper’s real contribution. A naive version of the idea would stop at irreversibility: hide the questions. That would be easy and nearly useless. A benchmark that cannot be used by models is not protected evaluation data; it is an expensive paperweight.
Equivalence is the trust problem. If evaluators cannot see the original questions and only receive latent projections, they need a way to verify that the latent version preserves the benchmark’s intended difficulty, semantics, and scoring behavior. Otherwise, contamination resistance quietly mutates into benchmark opacity.
Interoperability is the adoption problem. If every benchmark must be separately encoded for every model, the scheme collapses under operational friction. The paper therefore treats contamination resistance not merely as a cryptographic or privacy-like problem, but as an ecosystem problem: benchmarks, model developers, platforms, and evaluators need a reusable representation layer.
The mechanism: training wants the whole sequence; inference can start from a cache
The paper’s central mechanism comes from a difference between Transformer training and Transformer inference.
During training, a Transformer learns next-token prediction over token sequences. It needs the tokenized input sequence so it can compute hidden states across positions and update weights from the loss. If the plaintext benchmark question is available, the model can learn patterns from that question and its answer, directly or indirectly.
During inference, however, a Transformer does not need to repeatedly process the whole original prompt once the relevant internal state has already been computed. Autoregressive generation can use cached key-value pairs from previous tokens plus the hidden state needed to generate the next token. This is the familiar engineering reason KV caching speeds inference: the model stores reusable attention information from the prompt instead of recomputing it every time.
The authors turn that engineering convenience into a benchmark-release strategy. For each benchmark input, release:
$$ \text{CRD item} = (\text{KV cache},\ \text{penultimate hidden state},\ \text{plaintext target output}) $$
The input question is not released in plaintext. The expected answer or label can remain in plaintext because, without the corresponding readable input, the answer is not directly useful as a question-answer training pair. The model receives the internal state needed to continue generation and produce an output. The evaluator then scores that output against the released ground truth.
This is the paper’s “train later? not so fast” move. A plaintext question can be used in training. A KV cache plus final hidden state is intended to support continuation during inference, but not to provide the sequence of tokens needed for ordinary pretraining or fine-tuning.
The mechanism can be summarized as a pipeline:
| Stage | Plaintext benchmark | Contamination-resistant benchmark |
|---|---|---|
| Curation | Write and publish readable questions | Encode inputs through one or more anchor models into latent inference state |
| Release | Public text can be scraped | Public latent representation is released instead |
| Evaluation | Model reads prompt and generates answer | Model receives translated latent state and generates continuation |
| Scoring | Output compared with answer key | Output compared with the same answer key |
| Contamination risk | Questions and answers can become training pairs | Inputs are not directly available as token sequences |
The cleverness is not that KV caches are magical. They are not. The cleverness is that the benchmark is released at a point in the computation graph that is useful for evaluation but awkward for training.
The proposed evaluation framework moves the bottleneck to translation
The paper’s Figure 2 presents the operational framework. Its likely purpose is not main evidence, but implementation architecture: it shows how a contamination-resistant dataset would move from benchmark curation to discovery and then evaluation.
First, benchmark creators prepare the benchmark in plaintext. Then they use one or more anchor models to project the input side of the test split into latent form. These projections, not the readable inputs, are published with a datacard and some representative plaintext samples to support translation and verification. During discovery, evaluators translate the benchmark representation from the anchor model into the target model’s latent space. During evaluation, the target model generates outputs from the translated representation, and those outputs are scored against the answer key.
The business consequence is important: the trust bottleneck shifts from “who has access to the private test set?” to “who controls the projection and translation process?”
That is not a minor shift. It changes the roles in the evaluation market.
| Role | In ordinary public benchmarking | In a CRD-style benchmark ecosystem |
|---|---|---|
| Benchmark creator | Publishes test items and answer key | Publishes latent projections, datacard, verification protocol, and answer key |
| Model developer | Runs model on plaintext prompt | Translates latent benchmark representation or uses a provided translation service |
| Evaluation platform | Hosts leaderboard and scoring scripts | Hosts projection formats, translation tools, calibration checks, and scoring |
| Buyer or auditor | Reads benchmark scores and maybe samples | Examines whether the benchmark release process preserves irreversibility and equivalence |
This is why the paper is more relevant to evaluation infrastructure than to benchmark design alone. If implemented, CRDs would require formats, APIs, anchor-model governance, storage planning, verification procedures, and perhaps third-party encoding services. In other words, the glamorous future of benchmarking may involve a surprising amount of plumbing. History remains undefeated.
Interoperability is the hardest adoption problem
A benchmark encoded in the latent space of one model is not automatically usable by another. That is the obvious objection, and the paper does not ignore it. It offers two pathways: a near-term anchor-model approach and a longer-term model-agnostic relative-representation approach.
The near-term approach designates one or more widely adopted models as canonical encoders. Benchmark items are projected into those anchor models’ latent formats. Developers of other models then compute mappings from anchor-model representations to target-model representations.
The paper grounds this idea in prior work on representation alignment and adapter transfer, including Cross-LoRA, Trans-LoRA, LoRA-X, model stitching, representational similarity metrics, and the broader hypothesis that model representations may converge as systems become more capable. The practical point is not that every model representation is already perfectly interchangeable. It is that enough work exists on subspace alignment and cross-model transfer to make translation a plausible research path rather than pure hand-waving.
The authors also note that architectural similarity matters. Models sharing design choices such as grouped-query attention, SwiGLU activations, and RMSNorm are more likely to transfer cleanly than models with larger architectural mismatches. This matters for benchmark governance. Anchor models cannot be chosen by popularity alone; they need to represent architectural families likely to support faithful translation.
The longer-term path is more ambitious. Instead of encoding benchmarks relative to one canonical model, representations could be projected into a shared model-agnostic coordinate system using anchor samples. The idea is that relative geometric relationships among representations may remain stable even if absolute coordinates differ across models. If this works well enough, new models would need to process a shared anchor set to establish alignment, rather than negotiating pairwise translation with every benchmark anchor.
For business readers, this section should be read as roadmap, not deployed product documentation. The paper’s interoperability discussion is a plausible technical direction supported by adjacent literature. It is not a completed standard. The operational question is whether translation can preserve benchmark equivalence well enough across real model families, especially for difficult reasoning tasks and longer generations.
The paper’s evidence is architectural, not a leaderboard result
This paper does not present a new benchmark where ten models are evaluated under CRD conditions and compared against plaintext baselines. That absence should not be treated as a flaw hidden in the furniture. It is the genre of the paper: a position statement plus architectural proposal.
A useful way to read the paper is to classify its evidence and figures by purpose:
| Element in the paper | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Prior contamination studies and Figure 1 | Motivation and problem evidence | Public benchmarks are increasingly vulnerable to contamination | CRDs solve contamination in practice |
| Formal CRD definition and three properties | Conceptual framework | A benchmark must be inference-usable, unlearnable, irreversible, equivalent, and interoperable | The proposed implementation satisfies all properties under all threats |
| Transformer training/inference asymmetry and Figure 3 | Main mechanism | KV cache plus hidden state can support inference without releasing token sequences | KV caches are always irreversible or secure |
| Interoperability section | Technical feasibility argument | Representation alignment offers plausible paths for cross-model use | Translation fidelity is already solved for all LLMs |
| Benchmark compatibility discussion | Scope boundary | CRDs fit input-output benchmarks better than interactive or agentic tasks | CRDs can replace all evaluation methods |
| Figure 4 storage estimates | Implementation feasibility detail | Storage overhead may be manageable with compression | Storage, security, and fidelity trade-offs are solved in production |
This classification matters because otherwise readers will overbuy the idea. The paper is not saying, “Here is a drop-in benchmark standard, please ship it next quarter.” It is saying, “The evaluation community should stop publishing benchmark data in the exact form that makes contamination easy, and Transformer mechanics give us a candidate release format.”
That is a narrower claim. It is also a more useful one.
Storage overhead is large enough to notice and small enough to discuss
A natural business objection is cost. Plaintext benchmarks are tiny. KV caches are not.
The paper discusses this directly. It cites an example where a KV cache for 100,000 tokens in LLaMA-2 7B requires about 50GB of disk space. Then it points to KV-cache compression methods, including PyramidKV, that can sharply reduce the footprint while preserving performance. The paper states that retaining 12% of the KV cache can maintain performance and that even 0.7% has only a subtle effect in the cited work, reducing the 100,000-token estimate to about 350MB.
Figure 4 is best read as an implementation-detail feasibility check. It estimates storage requirements for several existing benchmarks using Llama2-7B and PyramidKV compression. The reported approximate storage sizes include:
| Benchmark | Approximate CRD projection storage |
|---|---|
| MMLU | 3,686 MB |
| HellaSwag | 3,233 MB |
| GSM8K | 530 MB |
| PIQA | 353 MB |
| ARC-Challenge | 287 MB |
| Winogrande | 199 MB |
| TruthfulQA | 143 MB |
| MBPP | 157 MB |
| HumanEval | 106 MB |
| OpenBookQA | 105 MB |
These are not terrifying numbers for institutional evaluation. They are also not zero. The operational question is whether the extra storage, projection, translation, and verification costs are justified by higher benchmark integrity.
For frontier model labs, cloud evaluation platforms, and serious enterprise buyers, the answer may often be yes. For casual experimentation, public demos, or small open-source projects, the friction may be too high unless platforms hide it behind standardized tooling.
That is why the paper’s call to integrate CRD-style validation into existing pipelines, including platforms such as Hugging Face, is not decorative. Without platform-level support, CRDs risk becoming another good idea admired in PDFs and ignored in workflows. The graveyard is well supplied.
Which benchmarks fit this mechanism, and which ones resist it
The CRD mechanism works best when inputs and outputs are structurally separable. A question-answer benchmark has an input question and a target answer. A classification benchmark has an input and a label. Code generation benchmarks often have a problem statement and solution tests. Summarization benchmarks have a source article and summary. These formats map reasonably well onto “protect the input, score against the output.”
The paper explicitly lists compatible benchmark categories such as single-turn question answering, classification and labeling, multimodal input-output benchmarks, code generation, and summarization.
The harder cases are interactive, adaptive, or trajectory-dependent benchmarks. Multi-turn conversational benchmarks, web-agent tasks, embodied-agent environments, adaptive tests, and tasks with evolving gold standards are less cleanly separable. In these settings, the model’s output changes the next input. The “input side” of the benchmark is not a static object that can be encoded once and released.
This boundary is highly relevant for business use. Many enterprise evaluations are moving toward agentic workflows: browse a site, use tools, inspect documents, fill a form, revise after feedback, escalate uncertainty. CRDs may still protect subcomponents of those workflows — for example, static source documents, initial task instructions, or individual evaluation cases — but they are not a complete answer to interactive evaluation.
So the correct adoption message is not “replace all benchmarks with CRDs.” It is more precise:
| Evaluation type | CRD fit | Practical interpretation |
|---|---|---|
| Static QA and classification | Strong | Good candidate for protected public evaluation |
| Code problem statements | Strong to moderate | Useful where problem input can be separated from execution-based scoring |
| Summarization and document tasks | Moderate | Useful if source inputs can be encoded while preserving task equivalence |
| Multi-turn dialogue | Partial | Later turns depend on previous model outputs, so static encoding is harder |
| Web or embodied agents | Weak to partial | Environment feedback and trajectories complicate protected release |
| Adaptive human-in-the-loop evaluation | Weak | Benchmark itself changes based on model behavior |
This is where the paper becomes more practical than it first appears. It does not need to solve every evaluation form to be valuable. If CRDs can improve the integrity of widely used static benchmarks, they can still raise the floor for model comparison.
The trust problem moves from memorization to verification
CRDs reduce one trust problem by introducing another. Plaintext benchmarks are easy to inspect but easy to contaminate. Latent benchmarks are harder to contaminate but harder to inspect.
That trade-off is not fatal, but it must be governed. The paper suggests reproducible verification protocols: paired evaluations against an anchor model, calibration checks across model families, distributional tests over outputs, and backtesting traditional benchmarks against their CRD counterparts.
For a business evaluation team, the governance checklist would look something like this:
| Governance question | Why it matters |
|---|---|
| Who created the plaintext benchmark before encoding? | CRDs protect release format; they do not guarantee benchmark quality |
| Which anchor model encoded the inputs? | Anchor choice affects translation fidelity and architectural coverage |
| What plaintext subset is available for audit? | Some human inspection is needed without exposing the full benchmark |
| How is equivalence verified? | Latent inputs must preserve the original task, not create a new hidden task |
| What inversion threat model is assumed? | Irreversibility depends on attacker capability, architecture, and economics |
| How are translations calibrated for target models? | Translation drift can distort model performance |
| What compression method is used? | Compression reduces cost but may affect fidelity and security |
| Which benchmark categories are excluded or only partially supported? | Static and interactive evaluations have different compatibility profiles |
The governance burden is real. But compare it with the current alternative: public benchmarks are widely used, widely discussed, widely scraped, and then still treated as if their scores were clean reflections of unseen generalization. That is not governance. That is optimism with a spreadsheet.
What Cognaptus infers for business use
The paper directly argues for contamination-resistant benchmark release formats and outlines a mechanism based on Transformer inference state. From that, several business implications follow. These are inferences, not results experimentally demonstrated by the paper.
First, benchmark quality will increasingly depend on release engineering. A high-quality test set published as plaintext may have a shorter clean lifespan than a slightly more complex test set released through protected representations. In model evaluation, data format becomes part of validity.
Second, third-party evaluation platforms may become more valuable if they provide standardized CRD tooling. The hard parts — projection, translation, compression, calibration, and audit protocols — are exactly the kind of infrastructure individual buyers do not want to implement from scratch.
Third, model procurement should separate “benchmark score” from “benchmark contamination resistance.” A score on a public benchmark should be treated differently depending on whether the benchmark was plaintext, private, dynamic, decontaminated, or contamination-resistant. The number alone is no longer enough metadata.
Fourth, CRDs could create a more durable basis for longitudinal comparison. Dynamic benchmarks solve leakage by changing the test. That helps freshness but weakens comparability over time. A protected release format could, in principle, preserve a stable benchmark longer without handing future models the readable exam.
Fifth, CRDs may be especially relevant for regulated or high-stakes evaluation, where reproducibility and independent scrutiny matter. Private benchmarking protects data, but centralizes trust. Public CRDs could offer a middle path: open enough to reproduce, protected enough to resist direct training contamination.
The phrase “in principle” is doing honest work here. The paper provides a credible mechanism, not a procurement-ready certification regime.
Boundaries: where the idea is fragile
The paper’s limitations are not generic “more research is needed” wallpaper. They affect whether the proposal works.
Irreversibility is threat-model dependent. Prior work has shown that KV-cache inversion attacks can recover input information under some conditions. The paper notes that such attacks appear more feasible for standard multi-head attention than for grouped-query attention and related modern architectures, but that does not eliminate the risk. Defensive mechanisms such as noise, perturbation, differential privacy, compression, or obfuscation may be needed. In privacy-sensitive settings, withholding anchor-model weights or using third-party encoding services may also be appropriate.
Equivalence requires auditing. A latent benchmark that produces different behavior from the plaintext benchmark is not a protected version of the same test. It is a different test wearing a serious hat. Paired evaluation, calibration, distributional checks, and CRD/plaintext backtesting would be necessary before scores can be trusted.
Interoperability is plausible but not solved. Cross-model representation alignment is an active research area. Translation may work better among architecturally similar models and worse across larger design differences. Long generation sequences may accumulate drift. This is not a reason to dismiss the proposal; it is a reason to treat anchor-model selection and translation validation as first-class engineering tasks.
The mechanism is Transformer-specific. The paper explicitly notes that CRDs based on Transformer training/inference asymmetry do not directly apply to non-Transformer architectures such as Mamba. Even among Transformers, attention mechanisms, positional encodings, normalization choices, and other design differences may change both security and fidelity.
Interactive benchmarks remain difficult. Static input-output tasks are the natural home for CRDs. Agentic and adaptive evaluations need additional design work. Since many enterprise evaluations increasingly involve tool use and multi-step workflows, CRDs should be viewed as one layer in an evaluation stack, not the whole stack.
These boundaries do not weaken the article’s main interpretation. They sharpen it. The paper is strongest when read as a proposal to harden a specific failure point: public release of static benchmark inputs.
The real shift is from benchmark publishing to benchmark governance
The most useful business reading of this paper is not “KV caches will save benchmarks.” That version is too neat, and neatness is how technical debt enters wearing a tie.
The better reading is this: benchmark publication has become a governance problem. The industry can no longer assume that public test items remain held-out in a world of web-scale scraping, derivative datasets, continual pretraining, synthetic data generation, and model distillation. Once that assumption breaks, evaluation infrastructure has to change.
Contamination-resistant datasets offer one concrete direction. They preserve the public, reproducible spirit of benchmarking while trying to remove the most dangerous part of public release: plaintext inputs that can become training examples. The method uses a real asymmetry in Transformer computation, then faces the practical consequences through three properties: irreversibility, equivalence, and interoperability.
For AI buyers, the lesson is immediate: ask not only what benchmark a model reports, but how that benchmark was protected. For evaluation platforms, the opportunity is infrastructure: build the translation, compression, calibration, and audit layer that makes protected benchmarks usable. For researchers, the challenge is empirical: prove when CRD-style evaluation preserves task meaning, resists inversion, and compares models fairly.
The paper is not the final architecture of trustworthy LLM evaluation. It is a useful correction to a lazy assumption: that better benchmark scores can be trusted while the benchmark data itself circulates freely through the training economy.
In the next phase of AI evaluation, the cleanest test may not be the one nobody has seen. It may be the one everyone can run, but no model can conveniently memorize.
Cognaptus: Automate the Present, Incubate the Future.
-
Ali Al-Lawati, Jason Lucas, Dongwon Lee, and Suhang Wang, “LLM Benchmark Datasets Should Be Contamination-Resistant,” arXiv:2605.19999, 2026. ↩︎