Judge, Jury, and Benchmark: Why LLM Evaluation Needs Fresh Cases, Not Bigger Leaderboards

The procurement meeting is where public leaderboards go to look useful

Benchmark scores are comforting because they compress chaos into a number. One model is 87.3, another is 84.9, and suddenly the procurement meeting has the emotional texture of financial discipline. Very mature. Very measurable. Also, very possibly irrelevant.

The problem is simple. A company rarely wants “the best model on average”. It wants the best model for contract review, support triage, clinical note summarisation, SQL repair, claims handling, product search, or whatever unglamorous workflow actually pays the cloud bill. Public benchmarks are often too generic for that decision. Worse, the benchmark items may already be floating inside model training data, turning evaluation into a memory test with better typography.

The paper behind CoEval takes this procurement problem seriously: how do you rank language models for a custom task when you have no labelled data and no trustworthy benchmark?¹ Its answer is not “ask GPT-4 to judge everything”. That would be the industry’s favourite reflex: when uncertain, summon a stronger oracle and pretend dependency is methodology.

CoEval’s answer is more operational. It builds a fresh benchmark from the task description, makes candidate models answer it, scores them through a cross-family judge panel, and then uses the panel’s own behaviour to weight both judges and questions. The central idea is not merely automation. It is leakage control.

CoEval is an evaluation pipeline, not a smarter scoreboard

CoEval separates the evaluation job into three roles:

Task description
      ↓
Teacher models: generate attributes, rubrics, and fresh benchmark items
      ↓
Student models: answer the generated items
      ↓
Judge models: score responses through a cross-family panel
      ↓
Aggregation: weight judges by agreement and items by discrimination
      ↓
Task-specific model ranking

This role separation matters because each stage closes a different failure channel.

The teacher stage creates the evaluation set. Instead of drawing from a static public benchmark, CoEval defines target attributes such as topic, difficulty, reasoning type, input length, severity, mechanism, or patient context. It then stratifies generation across those combinations. In plain business language: the framework forces the test to cover the edge cases you care about instead of letting the average case quietly eat the budget.

The student stage is ordinary but important: every candidate model answers the same generated prompts. These responses become more than outputs. They become diagnostic data.

The judge stage is where the paper’s main misconception gets corrected. Many readers will assume the important question is “Which judge model is strongest?” or “How many judges should we use?” CoEval suggests a more irritating answer: panel composition matters more than panel size. A larger panel can be worse if the added judges are weak, correlated, or biased in the same direction. Bigger committee, same nonsense. We have seen this movie outside AI as well.

Finally, the aggregation layer uses two label-free signals. A judge receives more weight when it agrees with the rest of the panel. An item receives more weight when it actually separates candidate models. This is the evaluation equivalent of refusing to spend time on questions everybody gets right and refusing to trust reviewers who appear to live in a different statistical universe.

The first proof point is objective QA, but its role is calibration

The paper’s first major experiment grounds CoEval against exact-match question answering, where correctness is objectively available. This is main evidence, but it is important to classify it correctly. It is not the intended final use case, because CoEval is designed for situations without labelled data. It is a calibration test: when ground truth exists, does the label-free judge ensemble track it?

On SciQ and ARC-Challenge, three student models produced 573 responses across 191 datapoints. A frontier cross-family judge panel — GPT-4o, Claude Sonnet 4, and Gemini 2.5 Flash — scored the responses. The CoEval accuracy dimension reached Spearman $\rho = 0.859$ against exact-match correctness, with a datapoint-clustered 95% confidence interval of $[0.77, 0.94]$.

More importantly for deployment, CoEval reproduced the true model ranking:

Model under evaluation	CoEval score	Ground-truth accuracy
gpt-4o-mini	0.963	0.969
gpt-3.5-turbo	0.921	0.942
llama-3.2-3b	0.843	0.832

The paper notes that CoEval scores were within 0.02 of the true accuracies and preserved the correct order. That does not prove every subjective custom evaluation will be correct. It proves something narrower and still valuable: when the scoring construct is well aligned with objective correctness, the framework can recover the ranking without human annotation.

There is also a useful negative detail. The off-target “relevance” rubric dimension had only $\rho = 0.20$ with correctness, while the full rubric average had $\rho = 0.55$. That is not a failure. It shows that construct matching matters. If you want accuracy, evaluate accuracy. If you average accuracy with softer qualities, do not be shocked when the result becomes less like exact correctness. Measurement is not soup.

The judge panel works because it can disagree with itself productively

The paper’s most business-relevant mechanism is the judge panel. CoEval does not simply average many judges and call it wisdom. It studies when the panel becomes reliable.

In the reliability experiment, adding judges in descending order of agreement produced a non-monotonic result. A selected two-judge panel reached ICC$(3,k)=0.70$. Adding lower-agreement judges reduced reliability to 0.45 and then 0.40. The authors interpret this through the Spearman–Brown relation: if an added judge lowers the average inter-judge correlation faster than judge-count averaging helps, reliability falls.

This is a clean result because it attacks a common implementation habit. Many AI teams treat ensemble size as a safety knob. Add more judges. Add more votes. Add more committees. Surely truth will eventually become embarrassed and reveal itself.

CoEval’s result says: no, not if the added judges are poorly aligned with the evaluation construct or share unhelpful biases. Reliability comes from selecting judges whose errors are not correlated and whose scores meaningfully agree with competent peers. Size is useful only after composition is under control.

The paper also quantifies judge-choice regret. Across a benchmark-grounded set of three tasks and 900 pooled responses, individual judges ranged from anti-correlated with ground truth at $-0.04$ to positively correlated at $0.31$. The best single judge changed by task. A fixed single-judge strategy therefore requires exactly the information the user usually lacks: knowing in advance which judge will behave well for this task.

That is the practical reason a cross-family ensemble is attractive. Not because ensembles are fashionable. Because selecting one judge in a no-label regime is a hidden bet.

Weighting judges by agreement turns the panel into an audit mechanism

CoEval’s reliability-weighted aggregator is a label-free robustness mechanism. Each judge is weighted according to its mean agreement with the panel. In the benchmark-grounded test, this lifted aggregate correlation from the plain mean’s 0.238 to 0.246. That numerical gain is modest, but the more important result comes from the stress tests.

The authors inject deliberately broken judges: one random, one constant, and one anti-correlated. Under a naive mean, correlation with ground truth drops from 0.238 to 0.126. Under reliability weighting, the broken judges receive essentially zero weight and the system recovers clean-panel accuracy at 0.228.

This is a robustness and sensitivity test, not the main proof that CoEval ranks all models correctly. Its purpose is to show that peer agreement can detect independent judge failures without labels. That is operationally valuable. In production, you rarely know whether a judge model is failing on Tuesday because of a model update, prompt sensitivity, task mismatch, or plain incompetence dressed in JSON.

The boundary is equally important. The paper shows that the mechanism fails when a correlated coalition of bad judges outnumbers the competent judges. Once five systematically wrong judges face four good ones, the bad coalition becomes the apparent consensus and can invert the recovered ranking.

That boundary is not a footnote-shaped apology. It explains why cross-family design is central. Peer agreement only works when judge errors are sufficiently independent. Vendor diversity is not magic, but it is a practical proxy for reducing shared failure modes.

The item weighting result says evaluation should ignore easy questions

CoEval also weights items by discriminative power. An item is useful for ranking if candidate models perform differently on it. If every model gets the item right, the item may be educationally pleasant but operationally useless. If every model fails, it may be too hard or underspecified. Either way, it carries little ranking signal.

The doubly robust aggregator combines judge-agreement weighting and item-discrimination weighting. On a thirteen-model benchmark-grounded check over science QA and reasoning, it improved rank recovery from a plain-mean Spearman correlation of 0.88 to 0.95. Kendall correlation improved from 0.76 to 0.87.

Aggregator	Spearman	Kendall	Likely purpose of the test
Plain mean	0.88	0.76	Baseline
Item-weighted only	0.94	0.84	Ablation: item discrimination
Judge-weighted only	0.92	0.82	Ablation: judge reliability
Doubly robust	0.95	0.87	Main combined mechanism
Plain mean with injected random judge	0.85	0.71	Robustness stress test
Doubly robust with injected random judge	0.95	0.87	Robustness of combined weighting

This matters because many internal evaluations are quietly saturated. A company builds a test set from obvious examples, all decent models pass it, and then the team spends a week arguing over tiny score differences. CoEval’s item weighting formalises a better instinct: allocate ranking authority to cases that expose capability differences.

For enterprise teams, this is the difference between a benchmark as theatre and a benchmark as diagnostic infrastructure. The first produces a report. The second tells you which model should handle the hard queue.

Bias cancellation is not moral purity; it is error diversification

The verbosity-bias experiment asks whether judges reward longer responses regardless of quality. Its likely purpose is bias robustness: even if individual judges carry length bias, does the ensemble reduce it?

The answer in the paper is yes, under the tested panel. Individual judges had mixed-sign length biases. GPT-3.5-turbo showed $r=-0.177$, penalising length; SmolLM2 showed $r=+0.234$, rewarding length. The panel’s mean absolute bias was $|r|=0.153$. The ensemble score had length-score correlation $r=+0.010$ with a 95% bootstrap interval of $[-0.039,+0.057]$, which includes zero. The authors report this as a 93% reduction in length-bias magnitude relative to the per-judge mean.

The paper is careful here, and the article should be too. The result does not prove that vendor diversity itself mechanically eliminates verbosity bias. In this panel, bias-sign diversity is correlated with vendor family and model capability: stronger OpenAI judges penalised length, while smaller open-weight judges rewarded it. The safer interpretation is that diverse judge errors cancelled. Vendor diversity is a convenient observable strategy for seeking that diversity, not a metaphysical property of company logos.

The business translation is simple. If all your judges are trained, tuned, or prompted in similar ways, they may reward the same bad proxy. If your panel contains genuinely different scoring tendencies, aggregation can cancel some of those tendencies. It is less elegant than “use the smartest judge”, but it has the advantage of not depending on one model being permanently sane.

Same-family self-preference is handled by architecture, not vibes

The paper treats self-preference as a design problem. Recent work has shown that judge-generator relatedness can inflate scores when a model judges outputs from its own family. CoEval’s architectural answer is vendor-disjoint scoring: when scoring a model under test, drop any judge from the same vendor family.

This is best read as a design property rather than a headline experiment. If a model never scores its own family, same-family self-preference cannot directly enter that model’s aggregate score. That is cleaner than measuring the bias afterwards and applying a patch while hoping the patch understands the disease.

The authors also measure residual same-family preference in the vertical case studies, where GPT-4o-mini appears both as a judge and a candidate. The residual effects are small and inconsistent in sign: +0.04 on clinical reasoning, -0.04 on legal, and -0.04 on drug interaction. Applying vendor-disjoint correction leaves rankings unchanged, with score shifts of at most 0.016.

Again, the practical lesson is architectural. If you know a failure channel exists, do not politely invite it into the scoring loop and then congratulate yourself for debiasing it later.

Fresh generation attacks contamination before it becomes a detective story

The contamination tests support another core mechanism: generate the benchmark after the fact. CoEval’s items are created fresh from a task specification, which means they should not have been available in a candidate model’s pretraining corpus.

The paper provides two tests. The first is a verbatim overlap check. It compares 400 CoEval-generated items against 491 items from five public benchmarks: XSum, CNN/DailyMail, CodeSearchNet, SciQ, and ARC-Challenge. Across 110,784 distinct public 13-grams, the generated items show zero 13-gram overlap. Mean and maximum overlap are both 0.0000.

This is evidence of non-duplication, not a complete membership test against every model’s training data. The authors say this directly. A 13-gram overlap test cannot prove an item is absent from all pretraining corpora. But it does support the structural claim that fresh generation avoids direct reuse of known public benchmark items.

The second test is more vivid. The authors fine-tune a small Qwen2.5-0.5B model to memorise 200 public SciQ items. On the contaminated benchmark, the memoriser scores 1.00 and beats GPT-4o-mini at 0.845. On 100 fresh held-out items, the order reverses: GPT-4o-mini scores 0.81, while the memoriser scores 0.74. The contaminated model’s memorised-minus-fresh gap is 0.26, compared with 0.10 for the clean base model, so the paper attributes 0.16 of the apparent static-benchmark edge to pure memorisation.

This is the part procurement teams should frame and place near the coffee machine. A tiny memoriser can beat a frontier model on a contaminated static benchmark. That does not mean the tiny model is better. It means your benchmark got mugged and is too embarrassed to file a report.

The vertical cases show intended use, not final clinical or legal validation

The case studies move CoEval into its intended setting: no labelled task-specific data and no trustworthy public benchmark. The authors test three custom verticals: drug–drug interaction reasoning, clinical reasoning, and legal analysis. From one-line descriptions, CoEval generates 40 stratified items per vertical, creates rubrics, obtains answers from three candidate models, and scores them through a cross-family panel.

The results are useful, but they need to be interpreted with discipline. These are case-first demonstrations of workflow feasibility and ranking behaviour. They are not clinical safety certification, legal accuracy validation, or proof that generated synthetic items fully represent professional practice.

Vertical	Top-ranked model	Notable result	What it supports	What it does not prove
Drug–drug interaction	gpt-4o-mini	0.770 vs 0.682 vs 0.497, with non-overlapping intervals	The generated items and panel can produce a clear ranking	That the ranking is clinically definitive
Clinical reasoning	gpt-3.5-turbo	0.873 vs 0.864 for gpt-4o-mini, overlapping intervals	The system can expose near-ties instead of forcing fake certainty	That gpt-3.5-turbo is generally superior for clinical work
Legal analysis	gpt-4o-mini	0.982 vs 0.740 vs 0.709	The generated legal items strongly separate one model from the others	That the benchmark covers all legal reasoning risks

The most interesting detail is not only the winner. It is item discrimination. Across 120 generated vertical items, 71% were discriminative, defined as a score range of at least 0.15 across the three models. Drug-interaction and legal analysis were especially discriminative at 78% and 85%. Clinical reasoning was only 50% discriminative because the two stronger models were genuinely close.

That is a healthy behaviour. A benchmark should separate models where they differ and admit uncertainty where they do not. A system that always produces dramatic gaps is not necessarily precise. It may simply be allergic to humility.

Domain-specific rankings are the business punchline

The final domain-divergence experiment explains why custom evaluation matters. CoEval generates 25 items each for four one-line domains: math word problems, code explanation, clinical reasoning, and legal analysis. Six candidate models answer them. The same cross-family panel scores all responses, producing 1,800 evaluations.

Three different models win across the four domains. GPT-4o-mini tops clinical reasoning and code explanation. Gemini Flash tops legal analysis. Claude 3.5 Haiku tops math word problems. The average cross-domain rank agreement is low, with Kendall $\tau=0.19$ averaged over six domain pairs. The least aligned pair, code explanation versus math word problems, has a negative point estimate of $\tau=-0.41$.

The pooled leaderboard ranks Gemini Flash first. But Gemini Flash is domain-best in only one of the four domains. GPT-4o-mini is best for clinical reasoning and code explanation. Claude 3.5 Haiku is best for math, even though the pooled board ranks it third.

This is the paper’s most direct business lesson. A generic leaderboard can be directionally useful for market watching, but it is not a procurement instrument. If your workflow has a specific distribution, scoring criterion, and error cost, the correct question is not “What model is generally best?” The correct question is “Which model wins on our task, under our rubric, on fresh cases, with judge failures controlled?”

That question is longer. It is also the one that matters.

What CoEval directly shows, and what Cognaptus infers

The paper’s evidence supports a specific operational interpretation. It should not be inflated into “synthetic evaluation solves model selection”. The more useful reading is narrower and stronger.

Layer	What the paper directly shows	Cognaptus business inference	Boundary
Fresh benchmark generation	Generated items show zero 13-gram overlap with five public benchmarks; fresh items defeat a memorisation inversion	Regenerate evaluation sets for each use case and model-release cycle	N-gram overlap is not a full training-data membership test
Ground-truth alignment	CoEval tracks exact-match QA correctness at $\rho=0.859$ and reproduces the three-model ranking	Use objective anchor tasks to calibrate an internal evaluation pipeline where possible	Objective QA is easier to validate than subjective enterprise workflows
Judge-panel composition	A selected two-judge panel reaches ICC 0.70; adding low-agreement judges reduces reliability	Build judge panels for error diversity and agreement quality, not headcount	Peer agreement can fail under correlated bad coalitions
Label-free weighting	Doubly robust aggregation improves thirteen-model rank recovery from Spearman 0.88 to 0.95	Weight judges and items instead of treating all scores as equally informative	Weighting depends on the panel and item pool producing meaningful signals
Domain-specific rankings	Three different models top four generated domains	Replace generic leaderboard procurement with task-specific model tournaments	Generated domains must still reflect real operational requirements

This gives AI teams a practical model-selection loop:

Define the target workflow and its important attributes.
Generate a fresh, stratified evaluation set.
Let candidate models answer the same cases.
Score with a cross-family, vendor-disjoint judge panel.
Weight judges by agreement and items by discrimination.
Inspect confidence intervals and disagreement, not only the winner.
Rerun after model updates, prompt changes, or workflow drift.

That loop is not glamorous. Good. Glamour is usually where evaluation discipline goes to die.

The business value is renewable diagnosis, not one more leaderboard

CoEval should be understood as evaluation infrastructure. Its value is not that it creates a prettier score table. Its value is that it turns model selection into a repeatable diagnostic process.

For procurement, it can reduce dependence on public leaderboards that do not match the company’s task distribution. For deployment gating, it can compare candidate models before routing real users to them. For regression testing, it can rerun fresh cases after model releases. For vendor management, it can make model switching less dependent on sales decks and more dependent on observed performance under the buyer’s own conditions.

There is also a subtler governance benefit. CoEval records how the benchmark was generated, which attributes were covered, which judges scored the responses, where judges disagreed, and which items actually carried ranking signal. That audit trail matters when the model decision affects regulated workflows, customer experience, or operational cost.

A normal leaderboard answers, “Who is winning this public race?” CoEval asks, “Who should do this job?” Those are different questions. The industry keeps confusing them because the first one is easier to screenshot.

The boundary conditions are not optional reading

The paper’s limitations affect practical use, so they should be near the operating manual, not hidden in a decorative caution paragraph.

First, the strongest ground-truth validation is on exact-match QA. That is appropriate for calibration, but subjective domains still rely on ensemble agreement, generated rubrics, and confidence intervals rather than external truth.

Second, peer-agreement weighting is only safe when judge errors are not dominated by a correlated bad coalition. Cross-family design reduces that risk, but it does not abolish it. A badly selected panel can still manufacture consensus.

Third, generated benchmarks need domain review when the workflow is high-stakes. CoEval can produce realistic cases from one-line descriptions, and the appendix examples show useful attribute and rubric generation. But in medicine, law, finance, insurance, or compliance, synthetic realism is not the same as professional validity. Human experts may not need to label every item, but they should still inspect whether the attribute space matches the real risk surface.

Fourth, the contamination evidence is strong for freshness relative to tested public benchmarks and the controlled memorisation experiment. It is not a universal proof that every generated item is semantically novel relative to every pretraining corpus. Fresh generation reduces a major leakage channel; it does not grant evaluation immortality.

These boundaries do not weaken the paper’s main contribution. They define where the framework becomes useful instead of theatrical.

Evaluation should route learning pressure to the cases that matter

The cleanest way to read CoEval is as a routing system for evaluation attention. It routes benchmark generation toward the task’s declared attributes. It routes scoring away from same-family self-preference. It routes trust toward judges that agree with competent peers. It routes ranking weight toward items that actually separate models. And it routes procurement away from the public leaderboard’s comforting but often irrelevant average.

That is why the paper is more important as a mechanism than as a score report. The numbers are useful: $\rho=0.859$ against objective correctness, ICC 0.70 for a selected two-judge panel, Spearman 0.95 for doubly robust rank recovery, 93% verbosity-bias reduction, zero 13-gram overlap with five public benchmarks, three different winners across four generated domains. But the numbers matter because they support an evaluation architecture.

Most companies do not need another benchmark. They need a renewable way to ask, under their own task distribution, which model deserves the next unit of trust.

CoEval is not the final answer to LLM evaluation. It is something more immediately useful: a disciplined answer to the no-data, untrusted-benchmark problem. In an industry still far too willing to confuse benchmark familiarity with deployment evidence, that is progress. Slightly inconvenient progress, naturally. The best kind.

Cognaptus: Automate the Present, Incubate the Future.

Alexander Apartsin and Yehudit Aperstein, “CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks,” arXiv:2606.03650, 2026, https://arxiv.org/pdf/2606.03650. ↩︎

The procurement meeting is where public leaderboards go to look useful#

CoEval is an evaluation pipeline, not a smarter scoreboard#

The first proof point is objective QA, but its role is calibration#

The judge panel works because it can disagree with itself productively#

Weighting judges by agreement turns the panel into an audit mechanism#

The item weighting result says evaluation should ignore easy questions#

Bias cancellation is not moral purity; it is error diversification#

Same-family self-preference is handled by architecture, not vibes#

Fresh generation attacks contamination before it becomes a detective story#

The vertical cases show intended use, not final clinical or legal validation#

Domain-specific rankings are the business punchline#

What CoEval directly shows, and what Cognaptus infers#

The business value is renewable diagnosis, not one more leaderboard#

The boundary conditions are not optional reading#

Evaluation should route learning pressure to the cases that matter#