The Judge Is Not Always Right: Stress‑Testing LLM Judges

A judge is useful only if it can survive the boring parts of reality.

Not the dramatic failure cases. Not the philosophical debates about machine intelligence. The boring parts: an extra blank line, a shorter answer, a paraphrased sentence, a multi-turn transcript where one message quietly changes the outcome, or a scoring rubric that asks for a number instead of a yes-or-no label.

That is where many automated evaluation systems live now. Companies use LLMs to grade support responses, rank AI-generated answers, screen safety violations, monitor agents, compare model versions, and produce internal leaderboards. Somewhere in the stack, a model is asked to judge another model. We then treat that judgment as data. Very tidy. Slightly terrifying.

The paper Judge Reliability Harness: Stress Testing the Reliability of LLM Judges introduces an open-source system for testing whether these judges are stable under controlled perturbations.¹ Its core contribution is not another benchmark score. It is a diagnostic workflow: take a judge, perturb the inputs in specific ways, measure whether the judge responds as it should, and report reliability by failure mode, task, model, and cost.

The practical lesson is equally plain: the question is no longer “Which frontier model should we use as the judge?” The better question is “Under which perturbations does this judge stop being a judge and start being a mood ring?”

The paper tests judge mechanisms, not just judge accuracy

The usual sales pitch for LLM judges is simple. Human evaluation is expensive. LLM evaluation is cheap and scalable. If the LLM judge agrees with humans on a validation set, we can use it to score outputs at scale.

The paper challenges the comfortable middle of that pitch. It does not deny that LLM judges can be useful. It asks whether a judge that looks acceptable on a point estimate remains reliable when the evaluated answer changes in ways that should, or should not, matter.

The Judge Reliability Harness, or JRH, runs a four-stage process:

load and normalize a seed benchmark dataset;
generate synthetic perturbations that target particular failure modes;
evaluate candidate LLM judges on those perturbed samples;
aggregate pass rates, confidence intervals, and cost curves.

This is mechanism-first evaluation. Each perturbation is a small experiment with an intended purpose. Some perturbations should change the correct judgment. Others should leave it unchanged. If the judge fails, the failure has a meaning.

JRH test family	Likely purpose	What a reliable judge should do	What failure suggests
Label flip	Main discriminative evidence	Change its decision when the response is rewritten to invert the ground-truth label	The judge cannot reliably distinguish success from failure under the rubric
Format invariance	Robustness / sensitivity test	Preserve the score when only layout changes	The judge is reacting to surface form rather than task quality
Semantic paraphrase	Robustness / sensitivity test	Preserve the score when meaning is held constant	The judge is overly sensitive to wording
Verbosity bias	Bias probe	Avoid rewarding length or punishing concision when content is equivalent	The judge confuses elaboration with quality
Stochastic stability	Repeatability test	Give consistent outputs across duplicate inputs	The judge is unstable even before content changes
Synthetic ordinal	Calibration / ordinal scoring test	Track intended score levels across a rubric scale	The judge struggles when evaluation is not binary
Agent transcript edits	Exploratory extension for agentic evaluation	Detect subtle multi-turn improvements or violations	Free-response judging does not transfer cleanly to agent monitoring
Human-in-the-loop review	Implementation quality control	Let reviewers accept, edit, or reject generated tests	Synthetic perturbations need domain supervision before they become evidence

This is the useful part for business readers. The harness is not merely asking, “Did the judge get the answer right?” It asks, “What kind of change broke the judge?”

That distinction matters because different business uses need different forms of reliability. A customer-support QA judge must be robust to tone and verbosity. A safety judge must detect label-changing violations. An agent monitor must catch subtle harmful behavior across long transcripts. A leaderboard judge must not reshuffle rankings because one model uses bullet points and another writes paragraphs like it is paid by the comma.

The tested benchmarks separate easy judging from costly judging

The paper evaluates four judge models across four benchmarks: FORTRESS, HarmBench, PERSUADE, and AgentHarm. The judge models are GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro, and Llama Maverick 4.1 17B, with Claude Opus 4.5 used in AgentHarm after the authors observed structured-score inconsistency with Sonnet 4.5 in preliminary AgentHarm experiments.

The benchmarks are not interchangeable. That is the point.

FORTRESS and HarmBench are closer to binary safety or misuse classification. PERSUADE asks for ordinal scoring of argumentative writing. AgentHarm evaluates harmfulness in multi-step agent tasks. These represent different business realities: classify a violation, grade quality, or monitor an agent over time.

The experiments are deliberately small. FORTRESS, PERSUADE, and HarmBench use ten manually selected samples, stratified by benchmark-specific categories. AgentHarm uses sixteen samples, two from each of eight harm categories. This is not a giant benchmark sweep. It is a stress-test demonstration under cost and review constraints.

That boundary matters. The right interpretation is not “Llama is always the best judge” or “Gemini is weak” or “Claude is unsafe.” The correct interpretation is narrower and more useful: a judge’s reliability profile depends heavily on task type, perturbation type, prompt, rubric, and input format. A single global judge ranking is mostly a spreadsheet-shaped illusion.

Binary classification looked easier; ordinal scoring exposed the cracks

The paper’s clearest pattern is that judges looked more robust on FORTRESS and HarmBench than on PERSUADE. The reason is not mysterious. Binary tasks ask the judge to choose between categories. PERSUADE asks the judge to assign a score on an ordinal scale from 1 to 6.

That difference sounds small until it reaches production.

A binary safety screen can be wrong, of course, but its decision surface is simpler. An ordinal score requires the judge to maintain relative calibration across multiple levels. It must know not only whether an essay is good or bad, but how good, according to a rubric, under perturbations that preserve or alter surface form.

The appendix numbers show the difference. Across non-agentic benchmarks, average judge scores on PERSUADE were much lower than on FORTRESS for all four models:

Benchmark	Best mean judge score reported	Weakest mean judge score reported	Business interpretation
PERSUADE	Gemini 2.5 Pro at 53.20%	Claude Sonnet 4.5 at 37.26%	Ordinal grading is fragile; do not treat rubric scores as clean measurement without calibration checks
HarmBench	Llama Maverick 4 at 73.92%	Claude Sonnet 4.5 at 60.50%	Binary harmfulness judging is more stable, but still not uniformly reliable
FORTRESS	Llama Maverick 4 at 78.75%	Gemini 2.5 Pro at 63.46%	Misuse/safety classification appears easier in this setup, but model ranking still varies by benchmark

There is a tempting but lazy reading here: “Use Llama.” Resist that temptation. Llama performs strongly in these experiments, but the more durable lesson is that ordinal scoring deserves separate validation. If a business workflow uses LLM judges to produce quality scores, risk scores, candidate scores, compliance severity scores, or vendor rankings, then binary validation is not enough.

A judge that can answer “pass or fail” may still be a poor measurement instrument.

Formatting was not cosmetic; it was the pressure test

The paper’s most irritating result is also one of its most useful: formatting changes can hurt judge reliability.

Irritating, because formatting should not matter. Useful, because real systems are full of formatting variation. Some models answer in Markdown. Some answer in plain text. Some use numbered lists. Some generate long paragraphs. Some add headers. Some insert odd spacing. Humans ignore most of this. LLM judges, apparently, may need a stern memo.

The authors test three kinds of format invariance: vertical spacing, intra-line spacing, and indentation. These are layout-only changes. They should not alter meaning. A reliable judge should preserve its score.

The appendix shows that, except for FORTRESS, the lowest mean scores often appeared in format invariance tests. PERSUADE is especially revealing:

PERSUADE test	Mean score across judges
format_invariance_1	35.00%
format_invariance_2	52.50%
format_invariance_3	40.00%
semantic_paraphrase	60.00%
stochastic_stability	37.50%
synthetic_ordinal	68.02%
verbosity_bias_long	42.50%
verbosity_bias_short	47.50%

This reverses a common intuition. We often worry that paraphrasing will confuse a judge because the wording changes. In the paper’s reported results, surface formatting can be at least as disruptive, and sometimes more disruptive, than semantic paraphrase.

For business evaluation systems, that means pre-processing and output normalization are not minor engineering details. They are reliability controls.

If one model produces compact answers and another uses verbose Markdown, an LLM judge may partly evaluate format rather than quality. If a production pipeline changes line breaks after a frontend update, the judge may drift without the underlying response quality changing. Congratulations: the product team has accidentally A/B-tested whitespace.

A reliability audit should therefore include format perturbations before the judge is trusted. Not because formatting is intellectually interesting. Because it is operationally everywhere.

Stochastic stability asks whether the judge can agree with itself

Another test family is even less glamorous: duplicate the same input and see whether the judge gives the same score.

This is stochastic stability. It isolates instability that does not come from content variation. If the input is identical, any score movement reflects sampling randomness, decoding settings, hidden nondeterminism, or fragility in the judge prompt.

In ordinary business terms: before asking whether the judge is fair, ask whether it is repeatable.

The PERSUADE stochastic stability mean score is reported at 37.50%, with a standard deviation of 22.17%. That is not a universal condemnation of all LLM judges; it is a warning about score-based workflows. When a judge is used to rank vendors, approve generated content, trigger escalation, or compare model versions, self-consistency is not a luxury.

A repeated evaluation system should have at least one of three protections:

deterministic or near-deterministic generation settings;
repeated judging with aggregation;
disagreement thresholds that trigger human review.

Otherwise, a workflow may produce the illusion of measurement while quietly sampling a new opinion each time.

Agentic evaluation breaks in more interesting ways

AgentHarm is where the paper becomes more relevant to the next wave of business AI systems.

Traditional free-response evaluation usually scores a single output. Agentic evaluation must inspect a sequence of actions, messages, and tool-use decisions. In that setting, failure may be subtle. A harmful action can be introduced in one step. A corrected transcript can look suspicious because the surrounding context remains risky. A judge must read the whole trajectory and apply the rubric consistently.

JRH’s agentic mode modifies Inspect evaluation files and generates two types of transcript changes:

agent perturbations, which degrade a transcript to induce rubric violations;
agent positives, which steer a transcript toward satisfying the rubric.

The human-in-the-loop step becomes much more important here. In AgentHarm, 14 out of 16 agent perturbation transcripts required one or more human message modifications, while only 2 out of 16 agent positives required such edits. The paper attributes the heavier intervention in perturbation mode mainly to limitations caused by safety guardrails in the model used to perturb transcripts.

That detail should not be skipped. It means agentic stress-test generation is itself hard. When the system tries to create harmful or policy-violating variants for evaluation, safety guardrails can interfere with the generation process. Human review is not bureaucratic decoration; it is part of making the test set usable.

The AgentHarm results also show asymmetric failure modes:

Model	Accuracy	False positive rate	False negative rate	Interpretation
GPT-4o	0.906	0.063	0.125	Strongest reported group; still misses some violations
Claude Opus 4.5	0.813	0.063	0.313	More likely to miss subtle harmful perturbations
Gemini 2.5 Pro	0.813	0.250	0.125	More likely to flag corrected transcripts as violations
Llama Maverick 4 17B	0.906	0.063	0.125	Matches GPT-4o in this setup
Best Trio Ensemble	0.906	0.063	0.125	No improvement over the best individual judges here

This is where aggregate accuracy becomes insufficient. Claude Opus 4.5 and Gemini 2.5 Pro have the same reported AgentHarm accuracy, but their errors are different. Claude’s false negative rate is higher. Gemini’s false positive rate is higher.

For a business, those are not interchangeable mistakes. A false negative in an agent safety monitor may allow a harmful action to proceed. A false positive may block a legitimate workflow and annoy users, employees, or compliance teams. Both matter, but they create different operational costs.

The useful question is therefore not “Which judge has higher accuracy?” It is “Which error profile is tolerable for this workflow?”

The cost curve quietly insults the premium-model reflex

One of the paper’s more commercially uncomfortable findings is that the cheaper model often looks competitive.

The authors report cost-efficiency numbers as dollars per accuracy point. Lower is better. In the overall row, Llama Maverick 4 17B is reported at 0.0010, compared with Gemini 2.5 Pro at 0.0080, GPT-4o at 0.0196, and Claude Sonnet 4.5 at 0.0223.

The token-cost table makes the reason unsurprising. Llama Maverick 4 17B is listed at $0.24 per million input tokens and $0.97 per million output tokens, far below GPT-4o, Gemini 2.5 Pro, and Claude variants in the paper’s pricing snapshot.

The business implication is not “always use the cheapest model.” That would be another lazy conclusion, wearing a finance hat.

The better interpretation is that evaluation deserves model selection just like generation does. A premium model may be worth it for complex judgment, subtle domain expertise, legal sensitivity, or high-stakes review. But the paper shows that cost and reliability do not move in a neat upward line. A smaller or moderately sized model can match or outperform frontier judges in specific reliability tests, at much lower cost.

That changes the ROI logic. Instead of choosing one expensive judge by brand reputation, teams can test several judges against local perturbation suites and choose by reliability-per-dollar under the actual rubric.

In other words: stop buying the judge by prestige. Buy it by audited behavior. Radical concept, I know.

What this means for business evaluation systems

The paper directly shows that JRH can generate reliability tests, evaluate multiple LLM judges, and expose task-specific fragility across perturbation types. It also directly shows that no tested judge is uniformly reliable across the examined benchmarks.

Cognaptus’ business inference is broader but still bounded: any organization using LLM judges should treat judge reliability as an operational metric, not a one-time model selection decision.

A practical reliability-audit workflow would look like this:

Step	Operational action	Why it matters
Define the local rubric	Specify what the judge should score, classify, ignore, and escalate	Reliability cannot be evaluated without a target behavior
Build seed examples	Use real or representative business cases across categories	Generic benchmarks rarely match local data distribution
Generate perturbations	Include label flips, format changes, paraphrases, verbosity variants, duplicates, and agent transcript edits if relevant	Each perturbation probes a different failure mode
Review synthetic tests	Use human reviewers for edge cases and high-risk domains	Synthetic tests can be wrong, especially for agentic or safety-sensitive tasks
Test multiple judges	Compare candidate models, prompts, and rubrics	The best judge is task-specific
Report reliability by failure mode	Avoid only reporting average accuracy	Average scores hide dangerous asymmetries
Add production controls	Normalize formatting, aggregate repeated judgments, route uncertain cases to humans	Reliability testing should change system design
Re-test after updates	Re-run suites when prompts, models, rubrics, or output formats change	Evaluation drift is still drift

This is especially important for four business use cases.

First, model QA and regression testing. If an internal AI product uses a judge to decide whether a new model version is better, the judge must be robust to formatting and verbosity changes. Otherwise, a product team may ship or reject a model based on presentation artifacts.

Second, safety and compliance monitoring. If a judge is used to flag unsafe or non-compliant outputs, false negatives and false positives must be reported separately. The AgentHarm results show why. Equal accuracy can hide opposite operational risks.

Third, agent supervision. As companies deploy tool-using agents, evaluation must move from single-turn outputs to multi-turn trajectories. A judge that works on a free-response benchmark may not reliably inspect agent transcripts.

Fourth, benchmarking and leaderboards. If public or internal leaderboards use LLM judges, then judge reliability should be reported alongside model scores. Otherwise, a leaderboard may be ranking systems partly by the quirks of its evaluator. That is not evaluation. That is office politics with tokens.

The appendix is not a second thesis; it explains where the main claim is fragile

The appendix matters because it separates the headline claim from its supporting diagnostics.

The cost tables support the cost-reliability argument. The PERSUADE table shows that accuracy alone is too crude for ordinal scoring, so the authors report CCC, Pearson correlation, Spearman correlation, and MAE. The non-agentic benchmark tables show that model performance varies by benchmark and perturbation. The AgentHarm table separates accuracy from false positive and false negative rates.

These are not decorative robustness checks. They tell the reader how to avoid over-reading the paper.

For example, the paper reports that GPT-4o has strong PERSUADE synthetic ordinal performance, with Pearson’s $\rho = 0.960$ and MAE $= 0.226$ in that test. But the same model’s PERSUADE stochastic stability and format invariance results are weaker. So the model can track ordinal score levels in one generated setting while still being vulnerable to other perturbations.

That is the point of a harness. It does not produce one sacred number. It produces a reliability profile.

Boundaries: this is a reliability harness, not a universal judge ranking

The paper is careful, and the business interpretation should be equally careful.

The experiments are preliminary. The benchmark subsets are down-sampled. The perturbations are synthetic. Some generated cases are validated by another LLM before evaluation. Human review helps, but the amount of review varies by benchmark, especially in AgentHarm. The tested judge prompts and rubrics matter. Pricing snapshots can change. Model versions can change. A result from safety benchmarks may not transfer to finance, law, medicine, education, customer support, or internal document review.

Also, the metric used in many tests is agreement with the expected label of the synthetic perturbation. That is suitable for stress testing, but it does not fully settle broader validity questions. A judge can be stable and still apply the wrong rubric. It can be robust to formatting and still miss domain-specific nuance. Reliability is necessary. It is not the whole kingdom.

The safe conclusion is therefore local, not universal: before using an LLM judge in a business-critical workflow, test the judge under the kinds of perturbations that workflow will actually experience.

The judge needs an audit trail

The paper’s most useful contribution is not that one model beats another in one table. Those results will age. Model names will change. Prices will change. Someone will release a new judge with a grander name and a longer context window, because apparently we deserve both progress and branding.

The durable contribution is the workflow.

LLM judges are becoming infrastructure. Once they grade outputs, trigger escalations, approve agents, or shape benchmarks, their failures become system failures. The Judge Reliability Harness gives practitioners a way to inspect those failures before they are hidden inside a dashboard metric.

The old evaluation habit was to ask whether an LLM judge agreed with humans often enough.

The new habit should be stricter:

Does it agree with itself? Does it ignore formatting when formatting is irrelevant? Does it detect label-changing edits? Does it behave differently on ordinal scoring? Does it fail through false positives or false negatives? Does the expensive judge actually buy reliability, or just a nicer invoice?

A judge that cannot answer those questions should not be quietly promoted into production.

It may still be useful. It may even be the best available option. But it should sit in the system with an audit trail, reliability profile, and escalation policy.

The judge is not always right. More importantly, now we can test how it is wrong.

Cognaptus: Automate the Present, Incubate the Future.

Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, and Morgan Sandler, “Judge Reliability Harness: Stress Testing the Reliability of LLM Judges,” arXiv:2603.05399, 2026. ↩︎

The paper tests judge mechanisms, not just judge accuracy#

The tested benchmarks separate easy judging from costly judging#

Binary classification looked easier; ordinal scoring exposed the cracks#

Formatting was not cosmetic; it was the pressure test#

Stochastic stability asks whether the judge can agree with itself#

Agentic evaluation breaks in more interesting ways#

The cost curve quietly insults the premium-model reflex#

What this means for business evaluation systems#

The appendix is not a second thesis; it explains where the main claim is fragile#

Boundaries: this is a reliability harness, not a universal judge ranking#

The judge needs an audit trail#