The Sealed Score: Why AI Evaluation Needs an Exam Day

A leaderboard score is useful until everyone starts treating it as a target.

That is the uncomfortable business problem behind LLM Olympiad: Why Model Evaluation Needs a Sealed Exam.¹ The paper is not arguing that benchmarks are useless. That would be theatrical, and not especially true. It argues something sharper: in the LLM era, a benchmark score is only as credible as the procedure that produced it.

That sounds obvious. It is not how the industry usually behaves.

Model cards, vendor pages, funding decks, product comparisons, and procurement memos still rely heavily on benchmark rankings. A few points on a public leaderboard can become evidence of “frontier capability,” “enterprise readiness,” or “reasoning superiority.” Very elegant. Also very fragile. A model may perform well because it is broadly capable. It may also perform well because the test was known, the prompt was tuned, the aggregation favored it, or the provider quietly tried many variants and presented the prettiest one. The leaderboard does not always tell us which story we are reading.

Cruz and Aji’s proposal is simple enough to explain in one sentence: evaluate LLMs like an Olympiad exam, where the rules are public, the exact problems are sealed, submissions are frozen, evaluation is centrally run, and the full task set and scoring machinery are released afterward.

The important word is not “sealed.” It is “afterward.” This is not a proposal for another permanently private benchmark. It is a proposal for temporary secrecy during measurement, followed by transparency after scoring. In other words, the benchmark is not just a dataset. It is a controlled event.

That mechanism is the article’s center of gravity.

The benchmark is the procedure, not the file

Most benchmark debates still talk as if the benchmark were mainly a set of questions and labels. The paper asks us to look at the surrounding machinery: who knows the tasks, when they know them, who runs the submissions, which settings are fixed, what gets logged, and what is released for audit.

That shift matters because modern LLM evaluation has too many movable parts. In a classic supervised benchmark, one might worry about the dataset and the metric. With LLMs, the evaluation also depends on prompting conventions, demonstration order, decoding settings, output parsing, tool access, retry policy, context limits, aggregation choices, and sometimes the hidden behavior of an API endpoint. A score becomes less like a measurement and more like a negotiated product of many small decisions.

The LLM Olympiad proposal tries to lock down four parts of that process:

Mechanism	What it reduces	What it does not magically solve
Sealed tasks	Benchmark chasing, direct task targeting, some contamination pathways	All possible overlap with training data or task leakage
Frozen submissions	Last-minute tuning after task reveal, endpoint drift during scoring	Private pre-submission experimentation across many model variants
Centralized harness	Prompting and execution differences across teams	Bad harness design, scoring bugs, weak aggregation choices
Post-hoc release	Opaque trust in a private evaluator	Future contamination after released tasks enter training corpora

Read that table carefully. The paper’s proposal is not a purity ritual. It is not claiming to make evaluation incorruptible. It is trying to turn several invisible risks into visible, governable ones.

That is the useful part.

A sealed test without a common harness still leaves room for evaluation plumbing to decide the winner. A common harness without sealed tasks still allows the field to optimize against a known target. A hidden benchmark without later release creates a trust bottleneck: useful for the examiner, less useful for the community. The Olympiad idea works only because the mechanisms are coupled.

Why ordinary leaderboards are becoming low-assurance evidence

The paper frames today’s evaluation problem through three failure modes: fragility, contamination, and incentive misalignment. These are not three separate complaints. They are three ways a leaderboard score can stop meaning what buyers, researchers, and regulators want it to mean.

First, rankings are fragile. The paper cites prior work showing that small choices in evaluation design can change which system looks best. For LLMs, this fragility is amplified by prompt-based evaluation. Demonstration order, output formatting, parsing rules, retry behavior, and aggregation can all matter. The business translation is blunt: a vendor’s benchmark advantage may reflect capability, but it may also reflect better knowledge of the scoring ritual. One is product quality. The other is exam technique.

Second, contamination is no longer a side issue. Models are trained on web-scale corpora, and popular benchmarks are public, copied, discussed, mirrored, and embedded in tutorials. Even when exact test items are not present, close variants may be. The paper cites examples where cleaned or newly created versions of math benchmarks changed apparent model performance, including reported drops on fresh grade-school arithmetic variants. The precise numbers matter less than the direction of the warning: when the test is part of the world the model has consumed, generalization claims become harder to interpret.

Third, leaderboard incentives reward selective disclosure. If a team can test many private variants and publish only the strongest one, the public score is not a clean sample of model quality. It is the winner of an internal tournament. That may be rational behavior. It is also a poor foundation for high-assurance comparison.

These three problems reinforce each other. Public tasks invite targeting. Targeting encourages prompt and pipeline optimization. Private iteration turns that optimization into a selection game. The final score then arrives looking objective, because numbers have excellent manners.

The Olympiad protocol responds by changing the timing and control points of evaluation. Participants know the rules, budgets, tracks, and interfaces. They do not know the exact tasks. They commit submissions before scoring. The organizers run the evaluation under one harness. After scoring, the artifacts are released so the community can inspect what actually happened.

The goal is not to remove all uncertainty. The goal is to make high scores harder to manufacture and easier to interrogate.

The useful trick: surprise in content, predictability in rules

A bad sealed exam would be a gotcha contest: arbitrary tasks, unclear scoring, mystery constraints, and a result that says more about organizer taste than model capability. The paper knows this, which is why its protocol separates task secrecy from rule secrecy.

The rules should be public. The task content should not be.

That distinction is easy to miss, but it is the core design principle. The paper recommends publishing a contest syllabus in advance: accepted interfaces, input and output formats, context and output limits, latency budgets, tool rules, submission contracts, scoring policies, aggregation rules, and release commitments. If tools are allowed, they should go through organizer-controlled proxies with budgets and logs. If outputs must be structured, schemas should be defined. If stochastic decoding is allowed, the number of runs and aggregation rule should be specified. If timeouts occur, the scoring policy should already exist.

This is not bureaucracy for its own sake. It is how the event avoids confusing “unseen tasks” with “unclear evaluation.”

A useful comparison is procurement. A company should not ask vendors to perform an unknown job under unknown rules and then call the result objective. But it can ask vendors to operate under a published interface and budget against fresh, unseen cases. That is closer to how real deployment uncertainty works. You know your product constraints. You do not know tomorrow’s exact user request.

The paper’s distinction between a model track and a system track is especially important for business readers. A raw model evaluation asks what the model can do with minimal external machinery. A system evaluation asks what the full orchestration layer can do, including retrieval, tools, routing, formatting, and policy enforcement. Collapsing these into one leaderboard is a small act of measurement vandalism. A tool-using agent and a base model are not the same object.

For enterprise adoption, the system track may eventually matter more. Most business value does not come from a model answering trivia in isolation. It comes from a model embedded in workflows: reading documents, calling tools, producing structured outputs, refusing unsafe actions, and remaining stable under messy input. The paper’s controlled harness and logged tool calls point toward a more realistic evaluation of those systems.

The figures and appendices are implementation scaffolding, not empirical proof

The paper is easy to overread because it includes several concrete examples: workflow diagrams, syllabus excerpts, consistency probes, submission cards, task templates, and a pilot plan. These are helpful. They are not experimental results.

That distinction matters for a Cognaptus article because business readers often ask, “So, did the method work?” Here, the honest answer is: the paper proposes the method and designs the operating protocol; it does not report a completed LLM Olympiad.

Paper element	Likely purpose	How to use it	What it does not prove
Comparison of open benchmarks, closed benchmarks, shared tasks, and Olympiad-style evaluation	Conceptual framing	Clarifies which tradeoffs the protocol tries to combine	That the Olympiad will outperform alternatives in practice
End-to-end workflow figure	Implementation overview	Shows the sequence from task solicitation to post-hoc release	That the workflow is operationally cheap or easy
Example syllabus excerpt	Rule-design illustration	Demonstrates how surprise can coexist with predictable constraints	That those exact limits are optimal
Consistency probe example	Diagnostic design	Shows how duplicated or perturbed items can flag instability	That consistency probes alone validate capability
Submission cards and task proposal examples	Operational templates	Clarify what participants and task authors would submit	That the proposed tasks are the final task set
Pilot plan	Feasibility roadmap	Defines a small first version with limited tracks and scale	That the event has already been run successfully

The proposed pilot is modest by design: two initial tracks, open-weights and closed-weights model submissions; optionally a non-competitive system track; 5–10 tasks; 100–500 instances per task; roughly 1,000–5,000 total instances; and a small number of consistency probes. The paper also proposes a practical timeline: task call, curation, submission window, hard freeze, evaluation period, results release, and post-hoc publication of tasks and code.

This is a sensible posture. Start small, reduce disputes, prove the harness works, then expand. Very unfashionable. Therefore probably wise.

The business value is assurance, not a prettier leaderboard

For enterprises, the paper’s value is not that it gives another leaderboard to quote. The industry already has enough league tables to keep marketing departments hydrated. The value is that it describes a higher-assurance evidence layer.

A normal benchmark score answers a limited question: how did this model perform on this known evaluation under some set of conditions? A sealed, frozen, centrally run event can answer a stronger question: how did this committed model or system perform on fresh tasks under disclosed constraints, using a common harness, with artifacts available for later inspection?

That difference matters in at least four business workflows.

Procurement: separate capability evidence from vendor storytelling

AI procurement often begins with claims that sound precise but are hard to audit: best on reasoning, strongest on coding, superior for enterprise use, safest in class. The buyer then receives benchmark charts, selective demos, and a cloud of adjectives. Some adjectives are expensive.

An Olympiad-style evaluation would not replace internal testing. It would give procurement teams a cleaner external signal. A result from a sealed event would still need interpretation by domain, cost, latency, privacy constraints, and integration requirements. But it would be harder to dismiss as pure leaderboard theater.

The procurement question changes from “Which vendor has the nicest chart?” to “Which system performs under a controlled surprise test, and under what budget?” That is progress. Not glamorous progress, but procurement rarely is.

Model governance: turn evaluation into an audit trail

The paper’s post-hoc release requirement is central for governance. Tasks, scoring scripts, harness details, run manifests, budgets, failure counts, and logs create a trail of evidence. That makes results more useful for internal audit, model risk committees, regulatory explanations, and vendor due diligence.

A governance team does not only need a score. It needs to know how the score was produced. Was the endpoint frozen? Were retries allowed? Were timeouts counted? Were tool calls logged? Were task families mixed? Were per-task results disclosed? Were closed-weight systems reported separately from open-weight systems?

Those questions are boring only until something fails in production. Then they become the entire meeting.

System evaluation: measure orchestration under constraints

Many enterprise AI failures are not base-model failures. They are system failures: retrieval pulls the wrong document, a tool call escapes its budget, a prompt injection changes behavior, JSON output breaks a downstream workflow, or the agent becomes unstable under minor input variation.

The paper’s system track is valuable because it treats tools and orchestration as part of the evaluated object. Tool access should be budgeted, proxied, and logged. Structured outputs should be validated. Failures should be counted. If a system succeeds only because it silently calls external resources outside the rules, that is not intelligence. That is a compliance incident wearing a lab coat.

For companies building agentic workflows, this suggests a practical evaluation architecture: separate model capability tests from system capability tests, and do not compare the two on the same leaderboard. A raw model, a retrieval-augmented system, and an autonomous agent should not be thrown into one ranking and asked to behave like comparable products.

Internal benchmarking: stop rewarding local overfitting

The paper is about community evaluation, but the logic applies inside firms. Many companies now maintain internal prompt suites and evaluation datasets. Over time, teams learn the suite. Prompts get tuned. Edge cases become familiar. The internal benchmark starts as a diagnostic and slowly becomes a training target.

A small internal Olympiad pattern could help: periodically create sealed tasks, freeze candidate systems, run them through one harness, and release the tasks afterward for learning. This would not replace regression tests. It would complement them. Regression tests ask whether the system still handles known cases. Sealed evaluations ask whether the system can handle new cases under the same rules.

That is a useful distinction. Regression protects yesterday. Sealed evaluation samples tomorrow.

Where the proposal is strongest—and where it is still vulnerable

The strongest part of the paper is its procedural realism. It does not pretend that evaluation credibility comes from a magic dataset. It comes from role separation, submission contracts, controlled harnesses, release bundles, task provenance, change-control policies, and explicit reporting. In short, evaluation becomes infrastructure.

The paper is also honest about the remaining risks.

Contamination can still happen. Sealed tasks reduce direct pre-event exposure, but they cannot guarantee zero overlap in a world of web-scale training data, synthetic data generation, leaks, paraphrases, and unknown data pipelines. Once tasks are released, they may also contaminate future models. The correct business interpretation is therefore “higher assurance than ordinary public leaderboards,” not “proof of no contamination.”

Closed endpoints remain lower assurance than organizer-run open artifacts. If a model is accessed through an API, organizers can commit to a version window and log behavior, but they cannot inspect weights or fully control backend changes. The paper’s suggestion to report open-weights and closed-weights submissions as separate assurance tiers is not a detail. It is essential.

The harness can be wrong. Centralizing evaluation removes participant-side variability but concentrates responsibility in the organizer’s machinery. A bad prompt template, flawed parser, poor timeout policy, or scoring bug can bias every result. That is why preregistration, public fingerprints, change control, and full re-runs after scoring bugs are not administrative decorations. They are the price of centralized authority.

Task quality is the hardest bottleneck. The proposal depends on high-quality, fresh, diverse, scorable tasks. Those are expensive to create. Task authors need incentives, recognition, conflict-of-interest rules, and possibly professional reward. Without that, the event risks becoming either too small to matter or too shallow to diagnose real capability.

Finally, access and fairness are not solved by good intentions. A sealed evaluation can privilege well-resourced participants if compute requirements, infrastructure, or submission packaging are too demanding. Budget classes, separate tracks, academic subsidies, and low-budget baselines are not optional if the event wants credibility beyond frontier labs.

What Cognaptus would infer for business practice

The paper directly proposes a protocol. It does not prove that the protocol will become the dominant evaluation standard. It does not provide a completed event with empirical rankings. It does not say enterprises should stop running their own domain tests. Good. That would be too convenient.

The business inference is more practical:

What the paper supports	Business interpretation	Boundary
Public benchmarks are exposed to fragility, contamination, and selective optimization	Treat leaderboard scores as low-assurance evidence unless procedure details are clear	Some public benchmarks remain useful for development, diagnosis, and comparability
Sealed tasks plus frozen submissions reduce targeting	Fresh, one-shot evaluations can better approximate preparedness	They do not prove universal capability or zero contamination
Centralized harnesses improve comparability	Vendor comparisons should control execution settings, budgets, parsing, retries, and tool use	Bad harnesses can create systematic error
Post-hoc release improves auditability	Procurement and governance teams should prefer results with reproducible artifacts and run manifests	Proprietary endpoints may only allow partial logs
Separate model and system tracks improve interpretation	Do not compare raw models, RAG systems, and agents as if they were identical products	Track design must match the deployment question

For Cognaptus-style automation work, the most important lesson is not “wait for an Olympiad.” It is to design evaluation pipelines with the same philosophy: freeze what is being tested, hide some fresh cases until evaluation, centralize the harness, log tool use, separate model and system results, and release enough artifacts internally for audit.

That is less exciting than a benchmark trophy. It is also more useful when money, workflow reliability, and accountability are involved.

The exam-day metaphor works because it limits both sides

The paper’s metaphor is effective because an exam constrains everyone.

It constrains participants: they cannot know the exact questions, tune against them, or change answers afterward.

It constrains organizers: they must publish rules, protect task secrecy, run a fair process, handle disputes, and release enough evidence for inspection.

It constrains readers: they should not treat a single number as the whole truth. A score is a summary of performance under a specific event design, not a certificate of general intelligence.

That last constraint is easy to forget. The Olympiad protocol would not end benchmark politics. It would create a cleaner arena for them. It would make certain games harder, certain claims easier to inspect, and certain procurement decisions less dependent on selective screenshots.

In AI evaluation, that counts as progress.

The field does not need fewer benchmarks. It needs more disciplined evidence. A sealed, frozen, centrally run, publicly auditable exam is one way to produce it.

Not because the exam is perfect. Because at least everyone knows when the test begins.

Cognaptus: Automate the Present, Incubate the Future.

Jan Christian Blaise Cruz and Alham Fikri Aji, “LLM Olympiad: Why Model Evaluation Needs a Sealed Exam,” arXiv:2603.23292, 2026, https://arxiv.org/abs/2603.23292. ↩︎

The benchmark is the procedure, not the file#

Why ordinary leaderboards are becoming low-assurance evidence#

The useful trick: surprise in content, predictability in rules#

The figures and appendices are implementation scaffolding, not empirical proof#

The business value is assurance, not a prettier leaderboard#

Procurement: separate capability evidence from vendor storytelling#

Model governance: turn evaluation into an audit trail#

System evaluation: measure orchestration under constraints#

Internal benchmarking: stop rewarding local overfitting#

Where the proposal is strongest—and where it is still vulnerable#

What Cognaptus would infer for business practice#

The exam-day metaphor works because it limits both sides#