Judging the Judges: How Bias-Bounded Evaluation Could Make LLM Referees Trustworthy
Scores look clean on dashboards. That is part of the problem. A model gets 4.7 out of 5. A customer-support agent receives a “pass.” A generated legal summary is marked “acceptable.” A coding assistant is judged “safe to deploy.” The number is tidy, the workflow continues, and everyone pretends the judge was a neutral instrument rather than another model with its own sensitivities, habits, and small theatrical preferences. ...