Opening — Why this matters now
LLM-as-judge has quietly become infrastructure. It ranks models, filters outputs, trains reward models, and increasingly decides what ships. The industry treats these judges as interchangeable instruments—different thermometers measuring the same temperature.
This paper suggests that assumption is not just wrong, but dangerously so.
Across thousands of evaluations, LLM judges show near-zero agreement with each other, yet striking consistency with themselves. They are not noisy sensors of a shared truth. They are stable, opinionated evaluators—each enforcing its own private theory of quality.
Background — The promise and the blind spot
The appeal of LLM-as-judge is obvious: scalable, cheap, and apparently consistent. Prior work has already catalogued biases—verbosity preference, position effects, prompt sensitivity—and treated them as correctable error.
What’s been missing is a measurement-theoretic question: are judges even measuring the same thing?
Instead of assuming disagreement is noise around a common construct, this study flips the lens. It treats each judge as a measurement device and asks whether disagreement itself is structured.
Analysis — What the paper actually does
The authors run a deliberately strict protocol. Nine frontier models evaluate the same artifacts under the same rubric, across repeated runs, with no output repair. The setup spans:
- 3,240 total evaluations
- 9 judges across five provider families
- 5 rubric dimensions (intent, coverage, faithfulness, readability, mechanics)
- Multiple independent runs per item
Crucially, judges must justify scores with quoted evidence (“receipts”), enabling analysis not just of scores, but of how judges reason.
The core questions:
- Do judges agree with each other?
- Are judges consistent with themselves?
- If they disagree systematically, can we tell them apart?
Findings — The reliability paradox, visualized
1. Agreement is effectively zero
Inter-judge agreement is near nonexistent. Krippendorff’s α hovers around 0.04, with some dimensions showing negative agreement—worse than chance. On readability and mechanics, one judge’s high score predicts another’s low score.
If these judges were measuring the same construct with noise, this would imply chaos.
2. Judges are stable—with themselves
They are not chaotic at all.
Within-judge reliability (ICC) reaches as high as 0.87. Several judges are remarkably self-consistent across runs. The same model, given the same content days apart, makes the same calls.
Low agreement plus high self-consistency yields a contradiction only if you assume a shared truth. Drop that assumption, and the picture snaps into focus.
3. Each judge has a disposition
Judges differ systematically along multiple axes:
| Axis | What it reveals |
|---|---|
| Harshness / leniency | Some judges consistently score lower or higher than peers |
| Dimension emphasis | Certain judges overweight faithfulness, others structure or intent |
| Evidence validity | How often cited quotes actually exist in the source |
| Semantic grounding | Whether evidence truly supports the justification |
| Shotgun index | Tendency to spray citations without grounding |
These traits are stable. They recur across items, runs, and even domains.
4. Judges are fingerprintable
From scores alone, a classifier can identify which judge produced an evaluation with ~77% accuracy. Add disposition features, and accuracy jumps to ~90%.
Even more unsettling: models from the same provider are distinguishable. GPT-4.1 and GPT-5.2 can be told apart from evaluation behavior with ~99.6% accuracy.
Evaluation style, it turns out, is a biometric.
Cross-domain stress test — This isn’t SEO-specific
To rule out domain artifacts, the authors repeat the experiment on Wikipedia briefing packs with controlled errors: hallucinations, missing coverage, and structural violations.
The fingerprints persist. Attribution accuracy remains ~90%. Harsh judges stay harsh. Lenient judges stay lenient.
More importantly, dispositions predict real capability gaps. Some judges reliably penalize hallucinations. Others are effectively blind, rating fabricated content as faithful.
This is not style. It is function.
Implications — What breaks if we ignore this
Benchmarks
Choosing a judge is not a technical detail. It determines what “good” means. Two benchmarks using different judges may be measuring incompatible objectives.
RLHF and alignment
Reward models inherit the evaluator’s values. Training on Claude feedback versus GPT feedback is not neutral—it pushes models toward different behavioral attractors.
Ensembles
Averaging judges does not recover truth. It produces a synthetic compromise that corresponds to no judge’s actual preferences.
Governance and audit
Evaluation behavior can reveal model identity and version drift. That has implications for provenance, accountability, and detecting undisclosed changes.
Conclusion — Stop pretending judges are neutral
The headline result is uncomfortable but clean:
LLM judges disagree profoundly, but not randomly.
Each judge encodes a stable, implicit theory of quality. Using one means adopting its values. Averaging many means adopting none.
The way forward is not to search for the “right” judge. It is to understand what each judge measures, report it transparently, and treat evaluator choice as the methodological decision it has always been.
Cognaptus: Automate the Present, Incubate the Future.