TL;DR for operators
A model score is not a certificate. It is a timestamp.
That is the operational message of D. Sculley and co-authors’ position paper on GenAI evaluation.1 Their argument is not that every static benchmark is useless, nor that competitions are magical truth machines with leaderboards attached. The argument is sharper: GenAI has broken the old bargain behind machine-learning evaluation.
Traditional benchmarks worked because the test set was supposed to estimate performance on unseen examples from the same distribution as the training data. GenAI systems are expected to do something harder: respond well to novel tasks, novel prompts, novel contexts, and sometimes novel problem types. In that world, leakage and contamination are not side issues. They are the evaluation problem wearing a polite hat.
For companies buying, comparing, or deploying GenAI systems, this changes the question from “Which model tops the benchmark?” to “How fresh, hidden, parallel, and workflow-relevant was the evaluation?” A static benchmark may still be useful for rough screening. It should not be treated as proof that a model can handle your next quarter’s customer tickets, compliance reviews, analyst workflows, code migrations, or delightful internal spreadsheet archaeology.
The practical replacement is not one benchmark. It is an evaluation process: fresh tasks, hidden test material, parallel model comparisons, frozen submissions where possible, post-deadline or future-collected data when available, and repeated meta-analysis across evaluation rounds. Less glamorous than a leaderboard screenshot. More useful, inconveniently.
The familiar benchmark story breaks at the word “novel”
A procurement team compares three GenAI vendors. Vendor A has a polished deck. Vendor B has lower latency. Vendor C waves around benchmark scores like a national flag. Someone asks the sensible question: “Which one is actually better?”
The old answer was: look at benchmark performance.
That answer made sense in classical supervised machine learning. A dataset was split into training and test portions. The test set was held out. Both sets were assumed to come from the same underlying distribution. If a model performed well on the test set, we treated that as evidence that it would perform well on future examples from the same distribution.
The paper revisits this logic carefully because the problem is not merely that GenAI models are bigger. It is that the object being evaluated has changed.
A conventional classifier might be asked to label images, transactions, or documents drawn from a reasonably defined population. A GenAI system is asked to answer questions, write code, reason through instructions, summarise messy context, follow policy, use tools, and recover gracefully from ambiguity. Its input space is effectively enormous. Its output space is worse. Its previous outputs may influence later outputs. The neat independence assumptions start looking less like a scientific foundation and more like a floor plan from a house that no longer exists.
The paper’s central move is to replace IID generalisation with novelty-centric generalisation. Under the IID view, a model generalises when it performs well on new samples from the same distribution. Under the novelty-centric view, a model generalises when it performs well on tasks it has not seen before, and not merely on near-neighbours of examples already absorbed during training or development.
That sounds obvious. It is also devastating.
If the real question is whether a GenAI system can handle genuinely novel tasks, then a public benchmark is already in trouble. Once the questions, answers, prompts, labels, or even similar examples have appeared online, they may become part of future training corpora, fine-tuning datasets, evaluation prompts, prompt-engineering folklore, or vendor optimisation loops. The benchmark does not need to be maliciously leaked to decay. It only needs to be useful enough for people to talk about it.
Nothing ruins a test quite like everyone studying the answer key.
Leakage is not a bug in evaluation; it is the main failure mode
The paper makes a useful distinction between a familiar worry and the more serious one.
The familiar worry is overfitting. If many researchers repeatedly optimise against the same benchmark, perhaps models become too specialised to that test set. In older machine-learning settings, this concern was real but sometimes less damaging than expected. Prior work cited in the paper found that public benchmark rankings could remain surprisingly consistent with fresh evaluation data in some contexts.
The more serious worry for GenAI is leakage. Leakage occurs when the evaluation gives a system access to information it should not have. In GenAI, contamination is a particularly important form of leakage: evaluation data, or close variants of it, appears in training data.
The paper’s rule of thumb is intentionally severe: treat a GenAI evaluation as leaked once the test data has been shared online or sent over the wire to a model provider.
This is not a claim that every provider trains on every prompt, or that every benchmark score is fraudulent. It is a statement about trust boundaries. If an evaluator cannot verify where the data went, how it was stored, whether it was logged, whether similar tasks were already present in training data, or whether future systems will be tuned against it, then the evaluation has lost part of its evidential force.
For enterprise use, this is the exact point where benchmark theatre begins.
A benchmark score may still be informative. It may tell you something about broad capability, tooling maturity, or the model family’s direction of travel. But it cannot bear the full weight of a deployment decision if the task requires novelty, confidentiality, or domain-specific reliability.
The sensible question is no longer:
“Did the model score well?”
It is:
“Could the model, vendor, or model-building ecosystem already have seen this task, this label, this pattern, or something close enough to matter?”
That question is more annoying. It is also more adult.
The paper’s evidence is case-based, not experimental—and that matters
This is a position paper, not a paper with a new model, benchmark table, ablation suite, or statistical experiment. Its evidence is mainly conceptual analysis, historical experience from competitions, and concrete leakage case studies.
That matters for interpretation. The paper does not prove that every AI competition is superior to every benchmark. It argues that competition structures contain mechanisms that are better suited to the core failure mode of GenAI evaluation.
A cleaner way to read the paper is as an engineering argument about evaluation infrastructure.
| Paper element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| IID versus novelty-centric evaluation | Main conceptual mechanism | GenAI evaluation needs a stronger notion of generalisation than conventional holdout testing | That novelty can always be formally measured |
| Leakage case studies from competitions | Main evidence by failure analysis | Leakage can occur through metadata, ordering, synthetic artefacts, public descriptions, and many other channels | That competitions are automatically leak-free |
| Review of unreleased holdouts, dynamic benchmarks, and community benchmarks | Comparison with existing alternatives | Current anti-leakage approaches help, but each has practical weaknesses | That those approaches should be abandoned |
| Competition structures: hidden tests, parallel submissions, isolated execution | Proposed structural solution | Competitions can reduce leakage and improve comparability when carefully designed | That a leaderboard result directly transfers to a particular business workflow |
| Recommendations on meta-analysis | Field-level operational extension | Repeated competition evidence should be synthesised, not consumed as isolated spectacle | That meta-analysis is already common enough in AI evaluation practice |
The paper is strongest where it explains mechanisms. It is weakest if read as a universal ranking of evaluation formats. Competitions are not blessed objects. They are useful because they can enforce time, secrecy, parallelism, and novelty more naturally than static benchmark publication.
That distinction matters. Otherwise the article becomes “Kaggle good, benchmarks bad,” which is both lazy and spiritually LinkedIn.
Static benchmarks decay because publication changes the object being measured
A static benchmark has one great advantage: reproducibility. Everyone can run the same test. The same prompts. The same labels. The same scoring procedure. This makes comparison easy.
The problem is that GenAI evaluation often forces a trade-off between reproducibility and robustness. The more openly reproducible a benchmark becomes, the more widely its contents circulate. The more widely its contents circulate, the less confidence we should have that future model performance reflects genuine novelty-handling rather than prior exposure.
The paper frames this as a fundamental tension. A fully public static benchmark is reproducible precisely because it is available. But its availability is also what makes it vulnerable.
This creates an uncomfortable hierarchy:
- A public static benchmark is easy to compare against, but likely to decay.
- A private holdout set is harder to contaminate, but harder to audit and reproduce.
- A dynamic benchmark refreshes the test set, but costs more to maintain and may shift the target over time.
- A community benchmark gathers fresh prompts and votes, but depends on user sampling, human judgement quality, and scale.
- A well-designed competition can combine hidden evaluation, time limits, parallel comparison, and leakage-resistant data collection.
The paper does not dismiss unreleased holdout sets, dynamic benchmarks, or community evaluations. It treats them as partial solutions. Private expert-written tests reduce leakage risk but require trust in the people and systems handling them. Dynamic benchmarks use recent data, but recent public data is still public data. Community benchmarks such as head-to-head model arenas provide fresh user inputs, but they may struggle with specialised domains, costly verification, and sampling bias.
The competition format enters because it can make the evaluation event itself part of the design. Time becomes a control mechanism. Submission deadlines matter. Hidden data matters. Frozen models matter. Data that does not yet exist matters.
That is the mechanism. The leaderboard is just the visible bruise.
Competitions work because they compress comparison into one protected window
The paper defines an AI competition as a task with an objective evaluation function, multiple independent attempts, and a time-bound period. This sounds bureaucratic until you compare it with the normal research cycle.
In the usual sequential model, one team proposes a method, evaluates it, publishes, waits, and then another team responds. The process is slow, and each result may be exposed to changing baselines, changing models, and changing contamination risk.
In a competition, many teams attack the same problem at once. The comparison is parallel. The test set can remain hidden. The evaluation rules can be fixed before final scoring. In code competitions, submissions can be run in isolated backends without network access, reducing the risk that hidden test data leaks back to participants.
Parallelism solves one problem that ordinary benchmark reuse cannot: timing. If many systems are evaluated in the same protected window, their results are more directly comparable. The evaluation asks, “What could these systems do before this deadline, under these rules, on this hidden or future-collected data?” That is a narrower question than “Which model is best?” But narrower questions are often the ones that survive contact with evidence.
The paper also points to a social mechanism: competition communities scrutinise data aggressively. When leakage exists, thousands of motivated participants may find it. This is not a romantic view of crowds. It is a realistic view of incentives. If there is a leaderboard advantage hiding in timestamps, row ordering, file size, synthetic artefacts, or public summary statistics, someone will probably sniff it out. Congratulations, the foxes are now part of the security audit.
The authors list several leakage examples from competition history: class information leaking through file timestamps, label patterns hidden in row ordering, randomisation mistakes caused by reused seeds, synthetic data artefacts caused by numerical precision differences, and private test information inadvertently exposed through research-paper descriptions. These examples are not GenAI-specific. That is precisely why they are useful. They show that leakage is not a theoretical problem reserved for sloppy teams. It happens even when competent people are trying hard.
For GenAI, where training data is vast, model access is often mediated through APIs, and prompts themselves may be logged or optimised against, the same problem becomes larger and less visible.
The strongest competition designs evaluate against data that did not exist
The paper’s most important practical contribution is its catalogue of leak-resistant competition structures. These are not decorative examples. They are design patterns.
Prospective ground truth
In prospective-ground-truth competitions, the test examples may exist, but the labels do not yet exist during the active modelling phase. The paper uses CAFA 5 protein function prediction as an example: competitors predicted protein functions before the relevant functional annotations had been determined, with final evaluation occurring later after new biological knowledge became available.
This is powerful because no model can memorise a label that no human has yet produced. It does not eliminate every possible shortcut, but it attacks contamination at the root.
For business evaluation, the analogue is future outcome testing. A bank could freeze fraud-risk models and evaluate them on transactions that settle after the model submission deadline. A support operation could evaluate proposed routing systems on tickets received after the evaluation design is locked. A sales organisation could test forecasting models against pipeline outcomes not yet realised. The important point is not the sector. It is the temporal firewall.
Novel task generation
Another strategy is to create entirely new tasks for evaluation. The paper points to AI Mathematical Olympiad-style challenges, where fresh math problems are written specifically for the competition. This matters because internet-scale training makes old problem sets suspicious. A model that solves a known contest problem may be reasoning, remembering, pattern-matching, or reciting from contaminated training data. The score alone cannot disambiguate.
Novel task generation is expensive. Experts must design tasks that are genuinely new, valid, solvable, and scorable. But for high-stakes evaluation, expense is not the objection people think it is. The alternative is cheaper uncertainty with a nicer dashboard.
Enterprise teams can borrow the same principle without staging a public contest. Create fresh cases from recent internal work. Have domain experts write new policy interpretation tasks. Build synthetic-but-audited scenarios that differ structurally from training examples. Use adversarial internal prompts that reflect edge cases, not demo-room niceties.
Post-deadline data collection
In post-deadline evaluation, submissions are frozen first and tested later on newly collected data. The paper discusses examples including a multilingual chatbot preference prediction competition evaluated on conversations collected after the active training phase, and the Konwinski Prize, which freezes submitted models and evaluates them later on fresh GitHub issues.
This is one of the cleanest lessons for operators. If you want to know whether a GenAI coding agent can handle future engineering work, do not only test it on famous repository issues. Freeze the system. Wait. Evaluate on new issues. Annoying? Yes. More probative? Also yes.
A static benchmark asks whether the model can solve yesterday’s known puzzles. Post-deadline testing asks whether it can survive tomorrow’s unknown workload.
The second question is the one the invoice cares about.
What businesses should copy from competitions
Companies do not need to turn every vendor selection into a public Kaggle contest. Please do not make procurement more theatrical than it already is.
They should, however, copy the structural logic.
| Competition principle | Enterprise translation | Why it matters |
|---|---|---|
| Hidden test data | Keep evaluation prompts, documents, labels, and scoring rubrics confidential | Reduces optimisation against the test rather than the task |
| Time-bound submissions | Freeze model versions, prompts, tools, and settings before final evaluation | Prevents moving-target comparisons |
| Parallel evaluation | Test competing models under the same task window and rules | Makes results more comparable |
| Prospective or future data | Evaluate on outcomes or cases unavailable during system design | Tests novelty rather than memory |
| Isolated execution where possible | Restrict network access, logging, and external calls during sensitive tests | Reduces data leakage and uncontrolled tool use |
| Shared post-mortems | Record failures, edge cases, and scoring disputes | Converts evaluation into organisational learning |
| Meta-analysis across rounds | Compare patterns over repeated evaluations, not one leaderboard | Distinguishes durable capability from lucky task fit |
This reframes AI evaluation as governance infrastructure. It is not something to perform after the vendor demo. It is something to design before the demo.
A mature evaluation process would ask vendors to run locked versions of their systems on fresh, role-specific tasks under controlled conditions. It would separate general capability tests from workflow-specific tests. It would preserve enough examples for internal reproducibility while protecting the final evaluation set. It would repeat the process as models change. It would treat a surprising win as a hypothesis requiring follow-up, not as a coronation.
This is not anti-benchmark. It is anti-benchmark idolatry.
What the paper shows, what Cognaptus infers, and what remains uncertain
The paper directly shows that traditional benchmark assumptions are poorly aligned with GenAI’s novelty demands; that leakage and contamination are unusually severe threats; and that competition practices provide several concrete mechanisms for reducing those threats.
Cognaptus infers three business lessons.
First, public leaderboard scores are useful as weak signals, not strong guarantees. They can help narrow a market scan. They should not decide a deployment.
Second, internal evaluation should be treated as an ongoing process, not a one-off model bake-off. The paper’s recommendation to value repeatable procedures over static reproducibility maps neatly to enterprise governance. A company should be able to repeat the evaluation process, even if the exact hidden test set changes.
Third, procurement teams should ask vendors uncomfortable but necessary questions: What data might your model have seen? Are our evaluation prompts logged? Can we opt out of training? Can the model be frozen? Can tool calls be controlled? Can you run in an isolated environment? What happens when we test on future cases rather than public examples?
What remains uncertain is equally important.
Competitions improve empirical rigour, but they do not automatically establish ecological validity. A competition task may be well designed and still differ from a company’s real workflows. Objective metrics may be clean and still miss user trust, legal risk, escalation quality, or cost-to-serve. Human preference evaluation may capture broad appeal while failing at specialist accuracy. Domain experts can write fresh tasks, but fresh does not always mean representative.
The paper acknowledges this through its call for meta-analysis, including analysis of how competition problems connect to real-world performance. That is the right instinct. One competition result is evidence. A stream of well-designed competitions, analysed over time, becomes knowledge.
The enterprise equivalent is also meta-analysis: not “Model X won our March test,” but “Across six quarterly evaluations, Model X improved in extraction reliability, regressed in policy adherence, remained brittle on ambiguous customer intent, and became cheaper per resolved case.” That is the kind of sentence an operating team can use.
The boundary: competitions are a standard of rigour, not a substitute for judgement
The paper’s strongest phrase is that AI competitions should be treated as a gold standard for empirical rigour in GenAI evaluation. That should be read carefully.
Gold standard does not mean universal standard. Some domains cannot easily expose tasks publicly. Some evaluation outcomes take months or years to observe. Some workflows require subjective judgement, human trust, regulatory interpretation, or organisational context that objective leaderboards cannot fully encode. Some high-value enterprise tasks are too idiosyncratic for a general competition to approximate well.
There is also an institutional boundary. The authors draw heavily on experience from Kaggle and related competition ecosystems, and Kaggle’s history gives the argument credibility. It also means readers should remember the perspective: this is a defence of a competition-centred evaluation culture by people who understand that culture from the inside. That does not invalidate the argument. It does mean the reader should inspect where the mechanism transfers and where it does not.
For most businesses, the right conclusion is not “trust competitions.” It is “steal the parts of competitions that make cheating harder, comparison fairer, and novelty more real.”
That is a useful standard because it is operational. You can ask whether your evaluation has hidden data. You can ask whether tasks are fresh. You can ask whether submissions are frozen. You can ask whether models are compared in parallel. You can ask whether the final cases existed before the model was tuned. You can ask whether repeated rounds tell the same story.
If the answer is no, then the benchmark may still be interesting. It is just not as evidentially heavy as the slide deck suggests.
The benchmark is dead. Long live the evaluation process.
The paper lands at an uncomfortable but productive conclusion: GenAI evaluation cannot rely on static benchmarks in the same way classical ML did. The old benchmark was a reusable object. The new benchmark needs to be closer to a repeatable process.
That change is annoying. It raises costs. It complicates reproducibility. It makes evaluation feel less like checking a score and more like running an operational discipline.
Good.
GenAI systems are already being asked to operate in fluid, high-context, high-ambiguity environments. A test that cannot survive public exposure, prompt logging, benchmark memorisation, or vendor optimisation should not be mistaken for a durable proof of intelligence. It is a snapshot, and sometimes a flattering one.
AI competitions matter because they show how evaluation can be redesigned around the real threat: not merely that models overfit, but that the test itself becomes part of the training ecosystem. Time limits, hidden data, parallel attempts, future labels, novel tasks, isolated execution, and post-deadline data collection are not leaderboard decorations. They are anti-contamination machinery.
For operators, the lesson is blunt. Stop asking whether a model has passed the benchmark. Ask whether the benchmark had any realistic chance of surprising the model.
That is where evaluation starts to become useful again.
Cognaptus: Automate the Present, Incubate the Future.
-
D. Sculley, Will Cukierski, Phil Culliton, Sohier Dane, Maggie Demkin, Ryan Holbrook, Addison Howard, Paul Mooney, Walter Reade, Megan Risdal, and Nate Keating, “Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation,” arXiv:2505.00612, 2025, https://arxiv.org/abs/2505.00612. ↩︎