Judging the Judges: How Bias-Bounded Evaluation Could Make LLM Referees Trustworthy

Scores look clean on dashboards. That is part of the problem.

A model gets 4.7 out of 5. A customer-support agent receives a “pass.” A generated legal summary is marked “acceptable.” A coding assistant is judged “safe to deploy.” The number is tidy, the workflow continues, and everyone pretends the judge was a neutral instrument rather than another model with its own sensitivities, habits, and small theatrical preferences.

That pretense is becoming expensive. As companies build agentic systems, automated QA loops, model leaderboards, safety evaluators, and AI-assisted review pipelines, LLM-as-a-Judge is quietly moving from experimental convenience to infrastructure. The uncomfortable question is no longer whether these judges are biased. They are. The better question is whether their bias can be measured, bounded, and reported without destroying the evaluation signal.

The paper Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation takes exactly that route.¹ Its central move is not to promise a perfectly fair AI referee. That would be charming, in the same way a perfectly objective performance review is charming. Instead, the paper proposes Bias-Bounded Evaluation, with Average Bias-Boundedness as the formal mechanism: measure how sensitive a judge is to controlled, meaning-preserving perturbations, then transform the judge’s scores so that the measured bias is reflected as uncertainty.

The business translation is simple but easy to misuse:

Do not treat an LLM judge’s raw score as truth. Treat it as a measurement instrument whose bias sensitivity must be certified.

That is the useful idea. Not magic fairness. Not “the judge is now correct.” A certificate.

The mechanism starts with a modest definition of “ideal judgment”

The paper begins with a careful assumption: for a given judge, task, and rubric, there exists an “ideal judgment” the judge would have produced without the unwanted bias. This does not mean the judgment is correct in some universal human sense. It only means the judge would have produced a stable score conditioned on the relevant content, without being pushed around by irrelevant features such as formatting, order, style, or hidden schematic weakness.

That distinction matters. If a loan-screening evaluator, hiring reviewer, benchmark judge, or agent monitor is fundamentally using the wrong rubric, Bias-Bounded Evaluation will not rescue it. The method is not a moral philosopher in a trench coat. It bounds the impact of measurable bias relative to a chosen evaluation context.

The paper defines a judgment as a vector of scores:

$$ j = (s_1, s_2, \ldots, s_k, s_{\text{overall}}) $$

where the individual $s_i$ values correspond to rubric factors and $s_{\text{overall}}$ is the final judgment. Bias is then treated as systematic deviation caused by factors not captured in the explicit rubric.

The next concept is the real hinge: neighboring judgment contexts.

Two judgment contexts are neighbors if they differ in one prompt-response pair, and that pair has been modified by a bias-introducing perturbation that preserves the germane semantic content. Formatting changes, stylistic paraphrases, changes in rubric emphasis, and subtle agreeableness tests can all be treated as neighbor-generating perturbations.

This is a practical framing. The question is no longer “Is the judge biased?” which usually produces a conference panel and no answer. The question becomes:

If we perturb the input in a way that should not change the true evaluation, how much does the judge move?

That movement is the thing the paper tries to bound.

Average-case bias is less heroic than worst-case bias, and therefore more usable

A common temptation in formal methods is to ask for the strongest possible guarantee. In LLM evaluation, that becomes awkward very quickly.

A worst-case guarantee would require protecting against any possible neighboring context, including adversarial changes that could move scores across the entire scale. That is often too conservative for actual evaluation workflows. If the bound must survive every imaginable perturbation, the mechanism may have to add so much noise that the score becomes nearly useless. Congratulations, the judge is safe because it says almost nothing.

The paper instead uses an average-case guarantee. It fixes a judgment context $D$ and a randomized neighbor generator $T$. The generator produces perturbed versions $D’$ of the evaluation context. The judge is represented as a deterministic scoring function $f$ over that context.

The key sensitivity metric is root-mean-squared sensitivity:

$$ \Delta_2^\ast(f, D) = \left( \mathbb{E}_{D' \sim T(D)} \left[ |f(D) - f(D')|_2^2 \right] \right)^{1/2} $$

This measures the typical magnitude of score movement under the chosen perturbation process.

The guarantee, called $(\tau, \delta)$-average bias-boundedness, asks whether the randomized mechanism $M$ can make the probability of a large score shift small:

$$ \Pr\left[|M(D) - M(D')|_2 > \tau\right] \le \delta $$

Here $\tau$ is the tolerance threshold and $\delta$ is the permitted failure probability.

The most important sentence in the paper is not the formula. It is the scope statement: the certificate is local to the fixed judgment context and the specific neighbor generator. It is not a distributional generalization guarantee over every future deployment.

This is where the likely reader misconception should be killed early. “Provably unbiased” does not mean “objectively fair forever.” It means: for this evaluation batch, with this judge, this rubric, and this perturbation generator, the average impact of the measured bias is bounded under the stated tolerance and failure probability.

That is narrower than the headline. It is also far more operationally useful.

The paper converts hidden judge sensitivity into visible uncertainty

The mechanism itself has two main moves.

First, estimate the judge’s RMS sensitivity by sampling neighboring contexts and observing how much the judge’s scores shift. Second, add calibrated Gaussian noise to the judge’s output:

$$ M_\sigma(D) = f(D) + Z,\quad Z \sim \mathcal{N}(0, \sigma^2 I_d) $$

The noise is calibrated so that the output satisfies the average bias-bound.

At first glance, adding noise to make a judge more trustworthy sounds perverse. The judge was already uncertain, so naturally we add more uncertainty. Very modern. But the logic is not that noise improves knowledge. The logic is that the raw score had false precision. If a judge’s score changes substantially under semantically irrelevant perturbations, the original number is already contaminated by uncertainty. Bias-Bounded Evaluation makes that uncertainty explicit.

The paper also introduces deterministic Lipschitz shrinkage before noise addition. In plain language, shrinkage compresses the score distribution around a center point. If $g$ is an $L$-Lipschitz map, then the RMS sensitivity of the composed judge $g \circ f$ is bounded by:

$$ \Delta_2^\ast(g \circ f, D) \le L\Delta_2^\ast(f, D) $$

A simple example is affine shrinkage:

$$ g(x) = \alpha x + (1-\alpha)\mu $$

where $\alpha \in (0,1]$ controls how aggressively scores are pulled toward a center $\mu$.

This is not merely aesthetic compression. It is a certifiability tool. Shrinkage reduces the effective sensitivity, so less Gaussian noise may be required to obtain the same type of certificate. The trade-off is obvious: too little shrinkage, and bias remains hard to certify; too much shrinkage, and the evaluation loses useful ranking signal.

That trade-off is the core engineering problem.

What the experiments are actually testing

The empirical section uses Arena-Hard-Auto, a benchmark built from 500 challenging Chatbot Arena queries. The paper evaluates four judge models: GPT-4o-mini-0718, QwQ-32B, DeepSeek-R1-Distill-32B, and GPT-3.5-Turbo.

The authors examine several bias-related components:

Test or component	Likely purpose	What it supports	What it does not prove
Inherent jitter	Implementation detail / baseline variance check	Separates natural judge variability from targeted perturbation sensitivity	Does not identify a specific external bias source
Formatting sensitivity	Main evidence	Tests whether BBE can control superficial prompt/style sensitivity while retaining ranking signal	Does not cover every possible presentation bias
Schematic adherence	Main evidence / structural bias test	Measures whether overall scores are explained by stated factor scores	Does not prove the rubric itself is correct
Judge-and-strategy heatmap	Robustness / sensitivity comparison	Shows signal preservation varies by judge and aggregation strategy	Does not establish universal performance across domains
Trust-or-Escalate comparison table	Comparison with prior work	Clarifies how A-BB differs from abstention-based human-agreement guarantees	Does not show A-BB is always preferable

This table matters because the paper’s experiments are easy to overread. The formatting and schematic tests are not “the bias problem solved.” They are demonstrations that the mechanism can preserve ranking signal while making measured bias uncertainty explicit.

The headline empirical result is that, on Arena-Hard-Auto, the method reports $(\tau = 0.5, \delta = 0.01)$ bias-bounded guarantees while retaining 61–99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations above 80%. The figures also show a conservative example using $\tau = 0.5$, $\delta = 0.03$, and dimension 500. One experimental paragraph lists $\tau = 0.05$ while figure captions use $\tau = 0.5$, so the safest reading is to preserve the settings as reported rather than silently harmonizing them.

That small notation wrinkle does not ruin the argument. It does, however, remind us that formal guarantees still arrive inside ordinary research papers, written by humans, with all the usual joys of notation management.

Formatting bias: the judge should not care how the answer dresses

The formatting experiment is the easiest business analogy.

Suppose two responses contain the same relevant substance, but one is formatted in a way the judge likes better. It has cleaner headings, a more familiar style, or a tone that resembles the judge’s preferred answer pattern. If the judge rewards that formatting when the rubric is supposed to evaluate substance, the score is biased.

The paper models this using a generative mechanism that produces semantically similar variants of model responses. It then measures how sensitive each judge is to those variants.

In one example, the QwQ-32B judge is evaluated under formatting sensitivity. The paper reports that after applying BBE, the debiased ranking retains strong signal. The figure caption reports 88% correlation with original judgments under a low-tolerance setting; the surrounding text reports 81% correlation to the original ranking. The more important point is not the exact correlation line in isolation. It is the shape of the transformation.

Before BBE, high-performing models can receive inflated scores with narrow-looking confidence intervals. After BBE, the score distribution compresses. Extreme certainty shrinks. The judge is not forbidden from ranking models; it is prevented from pretending that formatting-sensitive confidence is clean evidence.

For enterprise use, this is the difference between an evaluation report that says:

“Model A scored 92.”

and one that says:

“Model A ranks ahead under this judge, but the score has been compressed because the judge is sensitive to formatting perturbations.”

The second version is less glamorous and more useful. It tells the evaluation owner what kind of uncertainty is being carried forward.

Schematic bias: the judge may not be following its own rubric

Formatting bias is superficial. Schematic bias is more structural.

The paper defines schematic adherence as the degree to which an LLM judge’s overall score can be explained by its per-criteria factor scores. In Appendix C, the authors fit linear and polynomial models to predict the overall verdict from the factor-level scores. They define the schematic adherence score as the better $R^2$ across those models, then convert weak adherence into sensitivity:

$$ S_{\text{sch}} = \sqrt{1 - R^2_{\text{schematic}}} $$

Low schematic adherence means the judge’s final verdict is not well explained by the stated rubric factors. In business language: the judge wrote down a policy, then did something else.

The paper reports substantial variation across judges, with empirical schematic $R^2$ ranging from 0.10 for DeepSeek-R1-32B to 0.74 for GPT-4o-mini. That is not a small detail. It implies that some judges are much more faithful to their own explicit scoring structure than others.

In the schematic-bias experiment, the method compresses extreme distributions while preserving high ranking correlation. For GPT-3.5, the paper shows near-perfect correlation after debiasing in the illustrated schematic setting; for GPT-4o-mini, the correlation becomes nearly perfect.

Again, the right interpretation is not “schematic bias disappears.” The better interpretation is:

If the judge’s final score is drifting away from its stated rubric, the mechanism can force that drift to appear as uncertainty rather than as clean ranking confidence.

That is a useful governance primitive. It does not tell management what the right rubric should be. It tells management when the current judge is not behaving as if the rubric fully explains its verdicts.

The heatmap says the method is not equally forgiving to every judge

Figure 4 is the part of the paper that prevents the result from becoming a slogan.

The heatmap compares signal preservation by judge and aggregation strategy. GPT-4o-mini performs extremely well across the shown settings, with correlations around 0.999. GPT-3.5-Turbo also preserves strong signal in the displayed cases. DeepSeek-R1-Distill-32B varies more by setting. QwQ-32B shows weaker preservation in some strategies, around 0.613 under formatting-only and conservative aggregation, rising to about 0.709 under RMS aggregation.

This is not a failure. It is exactly what a useful diagnostic should reveal.

A judge with lower intrinsic jitter and lower measured systematic sensitivity is easier to certify while preserving ranking signal. A judge that moves around more under perturbation needs more compression or noise. That means the score becomes less discriminating. The method does not launder a weak judge into a strong one. It makes the weakness more visible.

For buyers of AI evaluation systems, that distinction matters. Bias-Bounded Evaluation is not just a way to “fix” judge outputs. It can be used to compare judge candidates:

Evaluation question	Operational interpretation
Which judge preserves ranking signal after bias bounding?	Better candidate for automated evaluation
Which judge needs heavy shrinkage or noise?	Higher governance cost
Which perturbation type causes the largest score movement?	Highest-priority failure mode to test
Which rubric factors fail to explain overall scores?	Rubric or judge-design problem
Which setting produces unstable certification?	Not ready for autonomous use

This is where the business value appears. Not in the phrase “provably unbiased,” which is dangerous if read lazily. The value is cheaper diagnosis of judge reliability before the judge is placed inside a workflow that can damage customers, employees, models, or databases.

Trust-or-Escalate and A-BB solve different governance problems

The paper compares A-BB with Trust-or-Escalate, a framework that provides guarantees of human agreement through cascaded LLM judges and confidence-based escalation. Trust-or-Escalate abstains or escalates when confidence is low. A-BB does not abstain; it bounds bias impact directly across evaluations.

The comparison is important because both approaches sound like “making LLM judges safer,” but their operational contracts differ.

Trust-or-Escalate asks: can the system decide when an LLM judgment is reliable enough to accept rather than escalate?

A-BB asks: given an evaluation context and measured perturbation process, can the impact of judge bias be bounded?

That creates different deployment patterns.

Governance need	Better fit
Human-agreement guarantee for selected cases	Trust-or-Escalate
Bias-impact certificate across all evaluations	A-BB
Pairwise preference evaluation with abstention	Trust-or-Escalate
General scoring beyond pairwise comparison	A-BB
Low-confidence cases routed to humans	Trust-or-Escalate
Scores retained but uncertainty exposed	A-BB
No human-labeled calibration data available	A-BB may be more practical

The paper notes that A-BB can be combined with conformal prediction methods for human-agreement guarantees. That is probably the more realistic enterprise future: not one magic framework, but layered evaluation assurance. Raw judge score, bias-bounded transformation, interval estimate, escalation policy, and human review for high-stakes cases. Boring? Yes. Also how infrastructure becomes less embarrassing.

What Cognaptus infers for business use

The paper directly shows a formal mechanism and empirical results on Arena-Hard-Auto. Cognaptus’ business inference is broader but should remain disciplined.

Companies already use AI to evaluate text, code, candidates, service quality, compliance claims, sales conversations, support tickets, agent outputs, and internal documents. In many of these settings, ground truth is sparse, slow, contested, or expensive. That is exactly why LLM judges are attractive.

The risk is that organizations will use these judges as if they were cheaper humans. They are not. They are scoring functions with measurable sensitivities. That makes them governable, but only if the organization performs the measurement.

A practical enterprise workflow could look like this:

Step	What the company does	Output
Define the judgment context	Fix the task, rubric, evaluated batch, and judge model	Evaluation scope
Build neighbor generators	Create formatting, order, paraphrase, rubric-emphasis, or error-injection perturbations	Bias test suite
Estimate RMS sensitivity	Measure how much scores move under each perturbation	Bias sensitivity profile
Apply shrinkage/noise	Transform scores so measured bias appears as uncertainty	Bias-bounded score output
Report retained signal	Check whether ranking correlation remains useful	Utility assessment
Decide routing	Use bounded scores directly, escalate, or reject the judge	Governance decision

This shifts evaluation governance from prompt folklore to measurement.

For low-stakes content ranking, a lightweight version may be enough. For regulated or high-impact domains, the method should be one layer in a broader assurance stack. If a bank, insurer, school, hospital, or hiring platform uses an LLM judge, “we prompted it carefully” is not a control. It is a diary entry.

A bias-bounded certificate is closer to a control.

The boundary: local, measured, finite-sample, and still dependent on the judge

The limitation section is unusually important here because the title can tempt readers into overclaiming.

First, A-BB does not guarantee absolute accuracy. A judge can be consistently wrong and still become bias-bounded relative to measured perturbations. If the rubric is bad, or the judge misses a key aspect of the input, the certificate does not repair the evaluation goal.

Second, the guarantee depends on the neighbor generator. If the organization measures formatting sensitivity but misses a larger source of bias, the certificate does not cover that unmeasured bias. Appendix B discusses conservative calibration across a family of measured neighbor generators: choose a bound large enough to cover the worst measured generator and any unknown bias whose RMS sensitivity is no larger. That is useful, but it is still bounded by what the measurement design can represent.

Third, the guarantee is local to the fixed judgment context. It applies to the certified batch and perturbation process, not automatically to future business contexts, new products, new users, new languages, or new adversarial behavior.

Fourth, the sensitivity estimate is empirical. The theory assumes access to the true RMS sensitivity. In practice, Algorithm 1 estimates it from sampled neighbors. With finite samples, the estimate can understate true sensitivity. The authors suggest using sufficiently large samples or adding confidence margins, such as upper confidence bounds. That is not administrative detail. It is the difference between a certificate and a nice spreadsheet.

Finally, the underlying judge still matters. The paper explicitly notes that A-BB retains more true signal when intrinsic jitter and systematic bias are weaker. A chaotic judge can be made more honest about its uncertainty, but not magically wise. There is no refund counter for ontology.

The real contribution is not debiasing; it is auditability

The title of the paper leans toward “provably unbiased.” The more precise contribution is “provably bias-bounded under specified measurement conditions.” That may sound less dramatic. It is also the part businesses can use.

LLM judges are not going away. Human evaluation is too slow, too expensive, and sometimes too inconsistent for the scale of AI systems now being built. Automated evaluation will keep expanding into agent monitoring, workflow QA, model selection, compliance screening, and synthetic feedback loops.

The question is whether those judges remain opaque scoring oracles or become auditable components.

Bias-Bounded Evaluation offers a mechanism-first answer:

Define the evaluation context.
Generate meaning-preserving biased neighbors.
Measure how much the judge moves.
Compress and noise the score so measured bias becomes visible uncertainty.
Report what signal remains.
State exactly what the guarantee does and does not cover.

That is not a glamorous story. It is better than glamorous. It is operational.

The strongest business lesson is not that every company should immediately deploy A-BB exactly as written. The lesson is that automated judgment must stop being treated as a cheap substitute for human judgment and start being treated as a measurement system. Measurement systems need calibration, uncertainty reporting, perturbation tests, and failure boundaries.

In other words, if machines are going to judge machines, the judge needs a judge too.

Cognaptus: Automate the Present, Incubate the Future.

Benjamin Feuer, Lucas Rosenblatt, and Oussama Elachqar, “Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation,” arXiv:2603.05485, 2026. https://arxiv.org/abs/2603.05485 ↩︎

The mechanism starts with a modest definition of “ideal judgment”#

Average-case bias is less heroic than worst-case bias, and therefore more usable#

The paper converts hidden judge sensitivity into visible uncertainty#

What the experiments are actually testing#

Formatting bias: the judge should not care how the answer dresses#

Schematic bias: the judge may not be following its own rubric#

The heatmap says the method is not equally forgiving to every judge#

Trust-or-Escalate and A-BB solve different governance problems#

What Cognaptus infers for business use#

The boundary: local, measured, finite-sample, and still dependent on the judge#

The real contribution is not debiasing; it is auditability#