When the Judge Needs Judging: LLM Evaluators Under Cross-Examination

The dashboard says the judge is fine. The document disagrees.

Judge is an easy word to trust. It suggests robes, procedure, and someone in the room who is supposed to be less confused than everyone else.

In AI evaluation, the word has become dangerously comfortable. Product teams now use LLMs to score summaries, rank chatbot answers, approve RAG outputs, compare model releases, and decide whether another model’s response is “good enough.” The attraction is obvious: human review is expensive, slow, and occasionally insists on context. An LLM judge is fast, scalable, and does not ask why the evaluation rubric was written five minutes before the sprint review.

The catch is not that LLM judges are useless. That would be too simple, and therefore emotionally satisfying. The sharper problem is that an LLM judge can look respectable on average while being unreliable on exactly the individual cases where the score is operationally used.

That is the mechanism explored in Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations, a paper by Manan Gupta and Dhruv Kumar.¹ The authors do not merely ask whether LLM judges correlate with humans overall. They ask a more business-relevant question: when an LLM judge gives this score, on this document, under this criterion, how much should we trust it?

That question matters because evaluation pipelines do not approve averages. They approve cases.

A benchmark table may say the judge is fine. A production system still has to decide whether this support answer can be sent, whether this summary can be published, whether this compliance response needs escalation, or whether this model release has actually improved. The paper’s central contribution is to show why the usual aggregate comfort can be misleading, then offer two diagnostic signals that can be used as a trust gate rather than as academic decoration.

The real failure mode is masked heterogeneity

Most LLM-evaluation discussions still lean on aggregate agreement metrics: system-level Kendall correlation, Pearson correlation with human scores, average score differences, or leaderboard stability. These are useful metrics, but they compress the distribution of reliability into a single number.

Compression is convenient. It is also where bad news goes to become quiet.

The paper’s first diagnostic examines pairwise preferences. Suppose an LLM judge compares several summaries for the same source document. If it says summary A is better than B, B is better than C, and C is better than A, the judge has created a directed 3-cycle. In plain language: the judge has contradicted itself across comparisons.

That cycle is not merely a philosophical inconvenience. It means the judge’s implied ranking is unstable. If a business uses that ranking to select the “best” generated response, the chosen output may depend on comparison order, aggregation method, or sampling noise. Excellent. We have automated decisiveness without automating consistency.

The authors compute a per-document violation rate:

$$ \rho(x)=\frac{\text{number of directed 3-cycles for document }x}{\binom{n}{3}} $$

Here, $x$ is the input document and $n$ is the number of system outputs being compared. This is the important design choice: the diagnostic is not just “how inconsistent is the judge overall?” It is “which documents make the judge inconsistent?”

That distinction changes the interpretation. In the coherence test, aggregate violation rates across judges range only from 0.8% to 4.1%. A slide deck could present those values and everyone would nod, because below 5% feels reassuring. But the per-document view is much less polite: 33.3% to 50.0% of documents show at least one directed 3-cycle, and Mistral-Small reaches a worst-case per-document violation rate of 30.4%.

The aggregate number says, “mostly fine.” The right tail says, “some cases are a mess.”

Signal	What a manager might conclude too quickly	What the paper shows
Low aggregate transitivity violation rate	The judge is generally consistent	Many documents still trigger at least one inconsistency
Median per-document violation rate of zero	Most cases are safe	A minority of documents drives severe instability
Similar headline judge performance	Pick the best model and move on	Reliability depends heavily on the criterion being judged
High coverage from conformal prediction	The score is precise	Coverage can be achieved with wide, uninformative prediction sets

This is the paper’s first useful correction to common intuition. The relevant business question is not whether the average judge is embarrassing. It is whether the pipeline can identify the cases where the judge should not be trusted automatically.

Transitivity is a smoke alarm, not a repair tool

A natural reaction to inconsistent pairwise rankings is to repair them. The paper tests exactly that. It compares several ranking methods, including Win Rate, Bradley-Terry, Schulze, MFAS-Copeland, and exact MFAS-ILP. MFAS means Minimum Feedback Arc Set: a way to find a ranking by reversing or removing the smallest number of conflicting edges in a directed graph.

This sounds like the kind of algorithmic fix that should make the system more elegant. It does not.

On coherence, MFAS-ILP improves Kendall’s $\tau$ against the human gold standard for one judge but matches or underperforms simpler methods for the other three. Win Rate and Bradley-Terry are often just as good or better. The paper interprets this as evidence that the cycles are sparse and concentrated, not a systematic directional bias that can be globally repaired.

That result matters because it changes the role of the diagnostic. Transitivity checking is not mainly a better ranking algorithm. It is a warning system.

If the tournament graph is almost acyclic for most documents but unstable for a subset, the practical move is not to pretend a clever ranking method will cleanse the uncertainty. The practical move is to mark those documents as risky and route them differently.

Test in the paper	Likely purpose	What it supports	What it does not prove
Per-document 3-cycle rate	Main evidence	Aggregate consistency hides document-level inconsistency	That every cycle means the selected output is wrong
MFAS ranking comparison	Diagnostic/ablation-like check	Ranking repair does not consistently improve human agreement	That all ranking repair methods are useless in all settings
Violation rates across all criteria	Robustness-style extension	Fluency and consistency show more instability across judges	That these criteria are always unreliable outside SummEval

This is where the paper becomes useful for production evaluation. A transitivity violation should be treated less like a bug to smooth away and more like a smoke alarm. The right question is not, “How do we force the judge to produce a clean ranking?” It is, “Why did this case make the judge unstable, and should a human or stronger evaluator see it?”

Conformal prediction turns doubt into a routing signal

The second diagnostic is more directly deployable: split conformal prediction over 1–5 Likert scores.

The judge gives a score, $\hat{y}$, and the human target is the rounded average human score, $y^\ast$. The nonconformity score is the absolute residual:

$$ s_i = |\hat{y}_i - y_i^\ast| $$

Using a calibration set, the method estimates a threshold $\hat{q}$. For a new case, it produces a prediction set:

$$ C(x)={y \in {1,2,3,4,5}: |\hat{y}-y|\leq \hat{q}} $$

The width of that set is the operational signal. A narrow set means the judge’s score is relatively informative. A wide set means the method cannot confidently narrow the plausible human-aligned score.

For example:

Prediction set	Operational reading
${4}$	The judge is highly specific; automated acceptance may be reasonable
${3,4}$	The judge is still usable, with mild uncertainty
${2,3,4,5}$	The score is weak evidence; consider a second review layer
${1,2,3,4,5}$	The judge has effectively covered the whole scale; congratulations, the machine has discovered shrugging

The formal advantage is coverage. At target miscoverage level $\alpha=0.10$, the paper reports that all 16 judge-by-criterion combinations meet the 90% coverage guarantee. The appendix further reports results across $\alpha \in {0.05, 0.10, 0.15, 0.20}$, again meeting or exceeding target coverage.

But coverage alone is not the business win. A prediction set that always includes every possible score would cover the truth beautifully and tell you almost nothing. The key is informativeness: how narrow is the set, and does width actually predict judge error?

The answer is mixed in exactly the useful way. Across all judges and criteria, set width correlates with actual judge-human disagreement: pooled Spearman $r_s=+0.576$ over 1,918 observations, with a vanishingly small p-value. At the criterion level, consistency shows the clearest width-error relationship, with $r_s=+0.34$; coherence and fluency are weaker but positive; relevance is the exception, with approximately zero correlation in the pooled reliability diagram.

So the conformal set is not magic. It is better than that: it is a measurable uncertainty signal with visible failure boundaries.

The criterion matters more than the badge on the judge

The most operational finding in the paper is not “GPT beats Qwen” or “Mistral fails.” The more annoying and more useful finding is that the evaluation criterion explains reliability more than the model identity.

The authors test four judges: GPT-4o-mini, LLaMA-3.1-70B, Qwen-2.5-72B, and Mistral-Small-3.1. They evaluate four criteria from SummEval: coherence, consistency, fluency, and relevance. The dataset is a cost-controlled subset: 30 documents, 8 summarization systems, and human Likert scores from SummEval rounded to integers for conformal calibration.

At $\alpha=0.10$, average prediction set sizes show the pattern clearly:

Judge	Coherence	Relevance	Consistency	Fluency
GPT-4o-mini	4.51	3.17	4.76	4.99
LLaMA-3.1-70B	3.04	2.82	4.97	4.93
Qwen-2.5-72B	3.62	2.99	4.48	4.76
Mistral-Small	4.47	3.06	4.98	4.99

Lower is better here, because smaller prediction sets are more informative. Relevance is consistently around 3.0 labels. Coherence is mixed but often usable. Consistency and fluency are near the full five-point scale across most judges.

This sounds counterintuitive until the task is unpacked.

Relevance asks whether a summary addresses the source document’s main content. That is a visible alignment problem. Coherence asks whether the summary hangs together structurally. Also visible, though model-dependent. Fluency, however, becomes difficult because modern generated summaries are usually all fluent enough. The judge has little signal to separate them. Consistency is difficult for the opposite reason: it requires factual cross-checking between source and summary, including omissions, distortions, and subtle entailment failures.

In business language: some evaluation criteria are cheap to automate because the signal is visible in the output. Others are expensive because the signal lives in the relationship between the output, source evidence, and domain-specific truth.

This is why procurement-style evaluator comparisons can mislead. “Which LLM is the best judge?” is often the wrong first question. Better questions are:

What exactly are we judging?
Does this criterion have enough observable variance?
Does it require source-grounded factual reasoning?
Can the pipeline measure uncertainty per case?
What happens when the uncertainty signal is wide?

A judge without a routing policy is not an evaluation system. It is just a confident spreadsheet with API costs.

Width agreement suggests hard cases are real, not just model quirks

A reasonable objection is that wide conformal sets may simply indicate a bad judge. Perhaps GPT is uncertain about one document, Qwen about another, and Mistral about a third. In that case, set width would be model noise, not document difficulty.

The paper tests this by measuring inter-judge agreement on prediction-set width: do different judges assign wide sets to the same documents?

For fluency, consistency, and relevance, the answer is mostly yes. Out of 18 judge pairs across those three criteria, 15 show significant positive agreement. Mean Spearman correlations are 0.38 for fluency, 0.32 for consistency, and 0.36 for relevance. Some pairs are especially strong: GPT/Mistral fluency reaches $r=+0.81$, and GPT/Qwen relevance reaches $r=+0.70$.

Coherence is the exception, with mean inter-judge agreement around 0.10. The authors suggest two possible reasons: the SummEval summaries may vary enough in coherence to make the task more discriminable, and model families may represent “coherence” somewhat differently.

For business use, the cross-judge agreement result is important. It implies that a wide set should often be interpreted as a warning about the case, not merely the model. If multiple model families tend to find the same document hard to score, adding another similar LLM judge may not solve the problem. It may just create a committee of uncertain machines, which is a familiar organizational design but not a quality guarantee.

What changes in a real evaluation pipeline

The paper directly studies SummEval summarization outputs, not customer-service tickets, legal memos, medical claims, or financial research notes. Still, the mechanism transfers cleanly enough to guide system design.

The practical lesson is not “stop using LLM judges.” It is “stop storing only the judge score.”

A production-grade AI evaluation layer should store at least:

Field	Purpose
Judge score	The model’s direct assessment
Criterion	The specific dimension being judged
Prediction set width	Per-instance uncertainty signal
Transitivity violation flag, where pairwise judging is used	Instability signal for ranking tasks
Escalation decision	Whether the case was auto-accepted, reviewed by another model, or sent to a human
Human override result	Feedback for calibration and audit

This changes the operating model from blind scoring to selective escalation.

A simple routing policy could look like this:

Diagnostic result	Suggested action	Business interpretation
Narrow prediction set, no cycle	Auto-accept or use in aggregate monitoring	Low review cost; acceptable risk
Medium prediction set	Use secondary judge or sample for audit	Useful but not decisive
Full-width prediction set	Human review	The score is not informative enough
Directed 3-cycle in pairwise ranking	Flag document or comparison set	Ranking instability, not just low confidence
Repeated wide sets under a criterion	Revisit rubric or require stronger evidence	The criterion may be poorly automated

This is not glamorous. That is part of its charm. Most enterprise AI improvement will not come from declaring that the model is “agentic.” It will come from mundane control points: uncertainty fields, escalation thresholds, audit logs, and feedback loops. The revolution, as usual, arrives wearing a spreadsheet.

What the paper proves, what Cognaptus infers, and what remains uncertain

The paper’s evidence is strong enough to support a useful design pattern, but not broad enough to justify universal rules. Keeping those apart prevents the usual slide-deck disease: a specific experiment mutates into a general law before anyone can stop it.

Layer	Statement	Confidence
Directly shown by the paper	Low aggregate transitivity violation rates can hide substantial per-document inconsistency	Strong within the tested SummEval subset
Directly shown by the paper	Split conformal prediction can provide coverage-guaranteed prediction sets for LLM judge Likert scores	Strong within the tested setup
Directly shown by the paper	Prediction set width correlates with judge-human disagreement in pooled results	Strong overall; varies by criterion
Directly shown by the paper	Criterion explains reliability more than judge identity	Strong within the four tested judges and four SummEval criteria
Cognaptus inference	Businesses should use width and cycle diagnostics as routing signals	Reasonable, especially for QA and evaluation pipelines
Cognaptus inference	Similar patterns may appear in RAG, agent evaluation, and content-review workflows	Plausible but requires task-specific validation
Still uncertain	Whether the same thresholds work outside summarization	Not established
Still uncertain	Whether different prompts, stronger models, or learned nonconformity scores would materially tighten sets	Open question

The limitations are not cosmetic. The paper uses 30 documents and 8 systems from SummEval. That is enough to expose the mechanism, not enough to declare universal thresholds. The judges use single prompt templates per criterion, so prompt sensitivity remains open. Human scores are averaged then rounded to integers, which introduces discretization into the calibration target. And split conformal prediction guarantees marginal coverage, not conditional coverage for every document type. In other words, the method can be reliable across the distribution while still being less comforting on specific hard subgroups.

That last point deserves emphasis. A conformal guarantee is not a personal promise to every document. It is a distribution-level guarantee. If your deployment data differs from the calibration data, or if your hard cases cluster in a way the calibration set did not capture, the apparent safety margin may be thinner than advertised.

This does not weaken the paper’s business relevance. It clarifies it. The right deployment lesson is to calibrate on your own task, track uncertainty over time, and treat wide sets as operational events, not as inconvenient statistical footnotes.

The better judge is the one that knows when to step aside

The old evaluation question was: can an LLM judge replace human evaluation?

The better question is: can an LLM judge participate in an evaluation system that knows when automation is cheap and when it is pretending?

This paper makes that second question much easier to operationalize. Transitivity violations expose cases where pairwise preferences become unstable. Conformal prediction sets convert a single score into a score plus uncertainty. Together, they shift LLM-as-judge from a black-box grader into a triage component.

That is the correct level of ambition. Not oracle. Not toy. Component.

For businesses, the payoff is not only cheaper evaluation. Cheap evaluation is already available; the problem is that cheap evaluation can be confidently wrong. The payoff is cheaper diagnosis: knowing which cases can safely flow through automation and which ones deserve human attention, stronger models, better prompts, or a rewritten rubric.

The judge still gets to judge. It just has to submit evidence now.

A terrible precedent for overconfident machines, perhaps. A useful one for everyone else.

Cognaptus: Automate the Present, Incubate the Future.

Manan Gupta and Dhruv Kumar, “Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations,” arXiv:2604.15302v1, 16 April 2026, https://arxiv.org/html/2604.15302. ↩︎

The dashboard says the judge is fine. The document disagrees.#

The real failure mode is masked heterogeneity#

Transitivity is a smoke alarm, not a repair tool#

Conformal prediction turns doubt into a routing signal#

The criterion matters more than the badge on the judge#

Width agreement suggests hard cases are real, not just model quirks#

What changes in a real evaluation pipeline#

What the paper proves, what Cognaptus infers, and what remains uncertain#

The better judge is the one that knows when to step aside#