The evaluator is not the scale

Evaluation looks boring until it changes the winner.

A product team compares three candidate responses. A benchmark ranks five model releases. A content workflow asks an LLM judge to score generated SEO packs. The spreadsheet fills itself politely: five rubric dimensions, an overall score, maybe a few quoted receipts. Everyone pretends the judge is just a thermometer.

Then someone swaps the judge.

The scores move. The ranking changes. The “best” model is suddenly not quite so best. The usual explanation is comforting: LLM judges are noisy, so use more judges, average them, and call the result a consensus. It sounds scientific. It also quietly assumes that all judges are noisy measurements of the same hidden thing called quality.

The paper Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior tests the less comfortable possibility: maybe different LLM judges are not noisy thermometers pointed at the same temperature. Maybe each judge is a different measuring device, with its own stable theory of what “good” means.1

That is the mechanism-first reading of the paper. The central result is not merely that LLM judges disagree. That part is almost too easy. The important part is stranger: they disagree systematically enough that their evaluations become fingerprints.

In the main experiment, inter-judge agreement is near zero. Yet individual judges are often highly stable with themselves. A classifier can identify which judge produced an evaluation with 89.9% accuracy when using both rubric scores and disposition features. Even GPT-4.1 and GPT-5.2, two models from the same provider, are distinguishable with 99.6% accuracy.

So the business question is not “Which judge is most accurate?” The paper does not prove that.

The sharper question is: when your AI system says it has measured quality, whose quality has it measured?

What the paper actually tests

The study treats LLM judges as measurement instruments. That framing matters because an instrument has properties: calibration, stability, sensitivity, failure modes, and bias. A thermometer can be wrong by two degrees. A bathroom scale can drift. A credit model can encode a definition of risk. An LLM judge can encode a definition of quality.

The main dataset is deliberately operational rather than abstract. The author evaluates 30 YouTube videos across 15 topic categories. For each video, four SEO content packs are generated using different LLM generators. The intersection set contains 120 unique video-by-pack items. Each item is evaluated three times by each of nine judges, producing 3,240 evaluations.

The judges include two Claude models, two GPT models, Gemini-3-Pro-Preview, Grok-3, DeepSeek-R1, Llama-405B, and Mistral-Large. Each judge receives the same prompt and the same five-dimension rubric:

Rubric dimension What it asks the judge to score
Intent & Angle Whether the generated pack captures the right user intent and framing
Coverage & Completeness Whether important content is included
Faithfulness & Receipts Whether the pack is grounded in the source
Readability & Structure Whether the output is readable and well organized
SEO Mechanics Whether it satisfies SEO-specific requirements

The judges must output structured JSON with dimension scores, an overall score, and quoted “receipts” supporting their assessments. The paper excludes models that fall below 98% protocol compliance, which is useful: this is not mainly a study of broken output formatting. It is a study of judges that can follow the procedure and still behave differently.

The paper then asks three linked questions.

First, do the judges agree with each other?

Second, does each judge agree with itself across repeated runs?

Third, if disagreement is systematic, can we identify the judge from the evaluation output alone?

That third question turns evaluation into fingerprinting. The model is not being identified by the text it generates in ordinary conversation. It is being identified by how it judges fixed artifacts under a fixed rubric. That is a harder and more interesting test. The judge is constrained; the fingerprint still appears.

The reliability paradox: low agreement, high self-consistency

The headline number is ugly: Krippendorff’s alpha is 0.042 overall.

In practical terms, that is near-zero absolute agreement. The dimension-level numbers are not much kinder:

Dimension Krippendorff’s alpha
Intent & Angle 0.050
Coverage & Completeness 0.132
Faithfulness & Receipts 0.090
Readability & Structure -0.064
SEO Mechanics -0.047

The negative values matter. They suggest that on Readability & Structure and SEO Mechanics, judges are not merely adding random noise around a shared interpretation. Their scoring patterns are systematically misaligned. One judge’s “good structure” may not predict another judge’s “good structure.” The rubric names are shared. The internal interpretation is not.

If the story ended there, the conclusion would be simple: LLM judges are unreliable. Use humans, shrug sadly, invoice the client.

But the paper’s next result makes that explanation too shallow.

Within-judge consistency, measured using ICC across three runs, varies widely. Some judges are unstable, but several are strongly self-consistent:

Judge Within-judge ICC(3,1)
Gemini-3-Pro 0.872
GPT-5.2 0.845
Claude-Opus 0.811
Mistral-Large 0.758
Grok-3 0.537
Claude-Sonnet 0.499
DeepSeek-R1 0.329
GPT-4.1 0.320
Llama-405B -0.038

This is the paper’s reliability paradox: judges do not agree with one another, but many are consistent with themselves.

That combination is the key. If all judges were trying to measure the same latent quality and merely making random errors, low inter-judge agreement should usually come with low within-judge stability. Instead, several judges are stable. They are not simply confused. They are applying different internal standards.

The paper calls these standards “evaluative dispositions.” That phrase is doing real work. A disposition is not just a bias term to be subtracted. It is a pattern of judgment: harshness, dimension emphasis, evidence behavior, citation habits, and failure modes. It is how the judge interprets the rubric.

A business team should hear this as a warning. A judge model is not an implementation detail after the “real” evaluation design is done. The judge model is part of the evaluation design.

Harshness is visible, but it is not the whole fingerprint

The easiest disposition to understand is strictness.

The paper estimates each judge’s mean scoring deviation from the across-judge mean for the same item and run. Negative values mean the judge scores lower than the cross-judge average; positive values mean it scores higher.

Judge Mean deviation from cross-judge mean Interpretation
Claude-Opus -0.429 Strict
Claude-Sonnet -0.340 Strict
GPT-5.2 -0.256 Strict, especially on faithfulness
Grok-3 +0.003 Near average
DeepSeek-R1 +0.164 Lenient
Mistral-Large +0.192 Lenient
Llama-405B +0.198 Lenient
GPT-4.1 +0.206 Lenient
Gemini-3-Pro +0.262 Most lenient

This already has operational consequences. If a product team switches from Claude-Opus to Gemini-3-Pro as a judge, the apparent score distribution may shift even if the evaluated product has not improved at all. Congratulations, the dashboard is happier. The system may not be.

But the paper is careful not to reduce fingerprints to “strict versus lenient.” That would be too convenient, and therefore suspicious.

The more interesting signal is the shape of judgment across dimensions. For example, GPT-5.2 is especially harsh on Faithfulness & Receipts, with a reported deviation of -0.64 on that dimension, while being more moderate elsewhere. Claude models are more uniformly strict across dimensions. Other judges differ in how they treat readability, coverage, and mechanics.

This matters because two judges with the same average score can still reward different behaviors. One may punish missing source support. Another may punish poor structure. Another may be generous as long as the format looks complete. Averaging them does not reveal “true quality”; it blends incompatible preferences into a synthetic score.

That is why the paper’s row-demeaning test is important. The author subtracts each evaluation’s mean score, leaving only the pattern across dimensions. Even after removing global harshness, exact-judge attribution from scores remains 62.5%, far above chance. The fingerprint is not merely the judge’s scoring altitude. It is the contour of its judgment.

Receipts reveal whether evidence is being used or sprayed

The paper’s strongest practical contribution is not only about scores. It also examines evidence behavior.

Each judge provides quoted receipts. The paper checks those receipts in two stages.

First, provenance validity: does the quoted evidence actually appear in the declared source text? This is checked using normalization and fuzzy matching.

Second, semantic linkage: conditional on the receipt being present, does the quote actually support the judge’s justification? This is tested using a DeBERTa-v3 NLI model, with a calibrated human audit discussed in the paper and appendix.

This distinction is useful because a receipt can be real but weak. A quote may exist in the source and still fail to certify the judge’s broader claim. The paper’s own language describes a common mismatch as the “apple vs orchard” pattern: the quote is topically related, but it does not prove the larger statement.

Across the three content-grounding dimensions — Intent, Coverage, and Faithfulness — the paper analyzes 31,232 receipts. Overall presence validity is 94.9%, but the range across judges is wide. Claude-Opus reaches 98.5%. Llama-405B is at 80.3%, meaning around one in five of its receipts does not match the source under the paper’s fuzzy matching pipeline.

Semantic linkage varies even more sharply. Among presence-valid receipts, linkage ranges from 15.4% to 44.2%.

Judge Presence-valid rate NLI linkage rate among valid receipts Evidence behavior signal
GPT-4.1 96.4% 43.6% Sparse, relatively well-grounded citation
Mistral-Large 94.1% 44.2% High linkage but high leniency elsewhere
Grok-3 97.4% 39.7% Moderate volume, moderate grounding
GPT-5.2 98.4% 37.1% Strong provenance, moderate linkage
Claude-Sonnet 96.8% 30.6% High citation volume, moderate grounding
Claude-Opus 98.5% 25.4% Strong provenance, lower linkage
Llama-405B 80.3% 25.9% Low provenance validity
Gemini-3-Pro 92.0% 17.7% Low linkage under this test
DeepSeek-R1 93.8% 15.4% Lowest linkage under this test

This section should not be overread. The paper explicitly treats semantic linkage as a relative judge fingerprint, not as absolute truth. A high linkage rate does not mean the judge is correct. A low linkage rate does not prove the judge is useless. The NLI pipeline itself is a measurement device with thresholds and error.

But relative behavior is exactly the point. Some judges cite sparingly and with tighter semantic support. Some cite heavily and loosely. Some produce many receipts but do not strongly connect them to the justification. The paper combines this into a “shotgun index”: total receipts multiplied by one minus the linkage rate. In plain English, that is a measure of evidence spray.

For business users, this is one of the most actionable ideas in the paper. If your evaluation workflow asks for citations, you should not merely count citations. Citation volume can become compliance theater. A judge that quotes many passages may look rigorous while providing weak support for its own claims. Very enterprise. Very familiar.

A better audit asks three questions:

Evidence question Why it matters
Did the quoted text actually appear in the source? Detects fabricated or malformed receipts
Does the quote support the judge’s specific claim? Separates topical evidence from certifying evidence
Is the judge’s evidence behavior stable across tasks? Turns citation style into an operational fingerprint

That last question is where the paper’s argument becomes more than a benchmark curiosity. Evidence behavior is not noise around scoring. It is part of the judge’s identity as an evaluator.

The fingerprint test turns dispositions into attribution

Once the paper has shown low agreement, self-consistency, harshness differences, and evidence behavior, it asks a direct question: can we identify the judge from the evaluation output?

The answer is yes.

Using grouped cross-validation by video, the paper reports the following attribution results:

Attribution task Feature set Accuracy
Exact judge, 9-way Scores only 77.1%
Exact judge, 9-way Disposition only 71.5%
Exact judge, 9-way Scores + disposition 89.9%
Provider lineage, 7-way Scores + disposition 91.5%
Within Claude, 2-way Scores + disposition 91.2%
Within GPT, 2-way Scores + disposition 99.6%

The within-GPT result is the one that should make benchmark owners sit up. GPT-4.1 and GPT-5.2 are distinguishable with 99.6% accuracy from evaluation behavior. That is not a family-level stereotype. It is version-level fingerprinting.

The paper also runs stricter checks. In leave-one-video-out validation, accuracy drops, as expected, because the classifier must generalize to an unseen video. But it remains far above chance: 37.4% with scores only and 59.8% with combined features. Shuffled-label tests fall near chance, around 8.2%. Tokens-only probes are near chance. These tests are not the main thesis; they are there to rule out lazy explanations, such as “the classifier just learned output length” or “the split leaked video identity.”

The appendix adds another useful control: per-judge marginal stripping through z-score and quantile normalization. Attribution remains very high after these oracle-conditioned transformations. This does not provide a deployable preprocessing trick, because it conditions on judge identity. It is a control analysis. Its purpose is to show that fingerprints are not reducible to simple score-scale usage.

Here is the clean interpretation:

Test Likely purpose What it supports What it does not prove
Main agreement and ICC analysis Main evidence Judges disagree with each other but may remain stable individually Which judge is correct
Harshness and dimension profiles Main evidence Judges have stable scoring dispositions That strictness alone explains behavior
Receipt validation and linkage Main evidence plus diagnostic extension Evidence behavior differs systematically by judge That NLI linkage equals factual truth
Grouped attribution Main evidence Evaluation outputs contain judge-identifying signals That fingerprints will stay fixed forever
Leave-one-video-out validation Robustness test Fingerprints generalize beyond seen video items Full generalization to all domains
Perturbation and temperature checks Robustness and sensitivity tests Fingerprints survive surface perturbations and are not mainly temperature artifacts Long-term temporal stability
Wikipedia controlled variants Cross-domain validation and exploratory capability test Fingerprints persist in a second regime and relate to hallucination detection General performance in regulated domains

That table is the paper’s logic chain. The main result is not any single number. It is the convergence of different tests toward the same conclusion: evaluation behavior is structured, stable, and identifiable.

The Wikipedia study shows the fingerprint is not just SEO weirdness

A fair objection is that the main study uses YouTube SEO content packs. Maybe the fingerprints are an artifact of that domain, that rubric, or those generators.

The paper addresses this with a second-regime Wikipedia study. It uses 15 Wikipedia articles across diverse topics and asks judges to evaluate structured briefing packs rather than SEO packs. The artifact format is different, but the rubric keeps the same broad five-dimension structure, with SEO Mechanics renamed Task Mechanics.

The Wikipedia study also introduces controlled variants:

Variant Manipulation
Clean Faithful, complete, well structured
Hallucination-poisoned Includes 3–5 false claims
Coverage-poisoned Omits 40–50% of key subtopics
Structure-poisoned Violates required format

All variants are generated by a single generator model, GPT-4.1, which helps control for generator-judge confounds. The study includes 1,066 parseable evaluations, slightly below the planned 1,080 after filtering.

The attribution results remain strong:

Feature set YouTube study Wikipedia study
Scores only 77.1% 80.9%
Disposition only 71.5% 77.0%
Scores + disposition 89.9% 90.3%
Within GPT, 2-way 99.6% 100%

This is not proof that the same fingerprinting strength will appear everywhere. But it is a serious robustness result. The fingerprint survives a shift from SEO packs to structured briefing packs, from YouTube content to Wikipedia source material, and from multiple generators to one generator.

The controlled variants also reveal why fingerprints are not merely cosmetic. In the hallucination-poisoned condition, some judges sharply reduce faithfulness scores. Others barely react.

Judge Clean faithfulness Hallucinated faithfulness Drop Paper’s verdict
Gemini-3-Pro 4.73 3.27 -1.46 Catches
GPT-5.2 4.34 3.22 -1.12 Catches
Claude-Sonnet 4.13 3.21 -0.92 Catches
DeepSeek-R1 4.51 3.60 -0.91 Catches
Claude-Opus 4.08 3.30 -0.78 Catches
GPT-4.1 4.41 4.09 -0.32 Weak
Grok-3 4.15 3.92 -0.23 Weak
Mistral-Large 4.28 4.29 +0.01 Blind
Llama-405B 4.12 4.39 +0.27 Blind

This table is the bridge from fingerprint to capability. Mistral-Large and Llama-405B rate hallucination-poisoned content as equally or more faithful than clean content. The paper further notes that Gemini-3-Pro assigns faithfulness scores of 3 or lower to 60% of hallucinated variants, while Mistral-Large, Llama-405B, and Grok-3 never assign a failing score in that condition.

That does not mean Gemini-3-Pro is “the best judge” in general. Remember the main study also finds Gemini lenient overall and low on the strict semantic-linkage test. Different instruments can be strong on one diagnostic and weak on another. Annoying, yes. Also reality.

The practical conclusion is narrower and more useful: a judge’s evaluation fingerprint can correspond to real capability differences. If your workflow cares about hallucination detection, you should test that capability directly rather than assuming a general judge score will capture it.

Why averaging judges is not a magic solvent

The paper is especially relevant to a common evaluation habit: use multiple LLM judges and average their scores.

Averaging is attractive because it feels like consensus. If each judge is a noisy estimate of the same latent quality variable, averaging can reduce variance. That is the clean statistical story.

But the paper’s mechanism challenges the premise. If judges encode different theories of quality, averaging does not recover ground truth. It creates a synthetic verdict that may correspond to no judge’s actual values.

Imagine three judges:

  • one heavily penalizes weak source grounding;
  • one rewards readable structure even when grounding is soft;
  • one mainly checks whether the artifact satisfies format requirements.

Their average score is not a neutral truth. It is a governance choice disguised as arithmetic.

This is not an argument against ensembles. It is an argument against unexamined ensembles. Multiple judges can be useful if their roles are explicit. For example, one judge can be a strict faithfulness auditor, another a usability reviewer, and another a task-compliance checker. Their outputs should be reported as separate diagnostic views, or combined using a policy that states the intended tradeoff.

What should not happen is the usual dashboard ritual: average the numbers, paint the cell green, and pretend methodology has happened.

What this means for business evaluation systems

The paper directly studies LLM evaluators. Cognaptus’ business inference is about the evaluation systems companies build around them.

The paper directly shows that, in the tested settings, LLM judges can have stable, identifiable evaluative dispositions. It directly shows near-zero inter-judge agreement, high self-consistency for several judges, strong attribution from scores and disposition features, cross-domain fingerprint persistence, and meaningful differences in hallucination sensitivity.

From that, a practical evaluation architecture should change in four ways.

First, judge selection should be documented like model selection. If a benchmark, internal QA system, or RLHF pipeline uses GPT-5.2, Claude-Opus, or Gemini-3-Pro as a judge, that choice should appear in the methodology. “LLM judge” is not enough. It is like saying “we used a database” while hiding whether it was a spreadsheet, a warehouse, or a damp notebook.

Second, evaluation reports should include judge profiles, not only final scores. A minimal judge profile should include score distribution, strictness relative to a panel, dimension emphasis, self-consistency, citation validity, and task-specific diagnostic behavior. For high-stakes workflows, add drift monitoring over time.

Third, teams should separate diagnostic dimensions before combining them. Faithfulness, coverage, readability, and task mechanics are not naturally one thing. Combining them requires a policy. If hallucination risk is more costly than awkward prose, the scoring policy should say so. Otherwise the system may quietly optimize for whatever the judge happens to reward.

Fourth, judge ensembles should be treated as panels, not blenders. A panel preserves disagreement. A blender hides it. If two judges disagree sharply on faithfulness, that disagreement is information. Averaging it away may produce a cleaner number and a worse decision.

A simple operating model looks like this:

Evaluation layer Recommended practice Business reason
Judge identity Record model, version, prompt, rubric, temperature, and date Prevents silent methodology drift
Judge profile Track strictness, dimension emphasis, self-consistency, evidence behavior Makes the instrument visible
Task calibration Test judges on known clean, flawed, hallucinated, incomplete, and format-broken cases Aligns evaluator choice with actual risk
Reporting Show multiple diagnostic scores before any aggregate Avoids fake consensus
Governance Re-run calibration after model upgrades or prompt changes Detects fingerprint drift

The ROI is not “cheaper evaluation” in the lazy sense. The ROI is cheaper diagnosis. A judge that gives a number without exposing its disposition may reduce labor while increasing decision risk. A profiled judge can help teams understand what they are rewarding, what they are missing, and when the evaluation pipeline itself has changed.

The boundary: stable does not mean correct

The paper is careful about what it does not claim, and the article should be equally disciplined.

It does not prove that any judge has the correct theory of quality. High self-consistency can mean disciplined judgment. It can also mean disciplined wrongness.

It does not prove that NLI-based semantic linkage is absolute ground truth. The paper’s own pilot audit shows the difficulty of matching atomic receipts to aggregated justifications. The expanded claimlet-level audit is much stronger, with 87.0% binary agreement on presence-valid receipts, but it still supports relative fingerprinting more than final truth certification.

It does not prove long-term stability. The main runs happen over a short period. The perturbation and temperature tests support robustness against surface variation and some sampling settings, but model providers can update systems. A judge fingerprint in January may not be identical in March. Quiet model updates are where evaluation governance goes to develop a drinking problem.

It does not prove universal domain transfer. The Wikipedia validation is important, but it is still one additional regime with a related rubric structure. Regulated domains, legal review, medical summarization, financial advice, safety red-teaming, and enterprise compliance may reveal different dispositions and different failure modes.

Finally, the study uses nine judges and particular feature definitions. A different judge panel, rubric, prompt format, or receipt mechanism could produce different magnitudes. The mechanism is the durable lesson; the exact numbers are not a universal constant.

The uncomfortable replacement for “LLM-as-judge”

The common phrase “LLM-as-judge” is starting to look too singular.

This paper suggests a better mental model: LLM-as-evaluator-instrument. Instruments can be useful. Instruments can be calibrated. Instruments can be compared. Instruments can be monitored for drift. But instruments are not interchangeable simply because they output numbers.

That shift changes the practical question.

The old question was:

Which LLM judge should we use?

The better question is:

What evaluative disposition do we want, how do we know this judge has it, and what failures are we willing to tolerate?

For benchmark designers, this means reporting judge identity and possibly reporting results across multiple profiled judges.

For RLHF and reward-model workflows, it means recognizing that evaluator choice shapes what future models learn to please.

For auditors, it means evaluation behavior itself can reveal model identity and potentially detect undisclosed model changes.

For product teams, it means the green score in the dashboard is not the end of the conversation. It is the output of a particular evaluator with particular habits. Some habits are useful. Some are dangerous. Some are merely expensive-looking theater.

The paper’s best line of force is simple: judges cannot agree on what is good, but they are consistent enough in how they disagree that we can identify them. That is not random noise. That is a fingerprint.

And once evaluation becomes a fingerprint, “we used an LLM judge” stops being a methodology.

It becomes an omission.

Cognaptus: Automate the Present, Incubate the Future.


  1. Wajid Nasser, Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior, arXiv:2601.05114, January 2026. https://arxiv.org/pdf/2601.05114 ↩︎