LLM-as-a-Judge

Judge, Jury, and Benchmark: The Metanym Game Grades the Graders

TL;DR for operators The Metanym Game asks models to invent structured analogies across unrelated domains, grade one another’s submissions, and reveal which graders deserve to be trusted.1 Because the test material is produced during the run, there is no fixed question bank waiting to appear in tomorrow’s training corpus. The clever part is not merely letting models vote. The paper separates two problems that most automated evaluation systems casually blend together: ...

Judge Math-Not by Its Parser

Opening — Why this matters now The AI industry has discovered a wonderfully pedestrian way to misread progress: build models that can solve harder math problems, then grade them with evaluators that panic when 2040 minutes is not written as 34 hours. That is not a joke. It is the central irritation behind “Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity”, an arXiv paper that examines how mathematical reasoning benchmarks can be distorted by rigid symbolic verification.1 ...

When the Referee Wants to Be Nice: Hidden Bias in AI Judges

Audit. That is the word companies use when they want something to sound objective, disciplined, and preferably immune to politics. A model produces an answer. Another model evaluates it. The evaluator gives a verdict. Everyone gets a dashboard. The dashboard gets shown to management. Management nods, because dashboards have a calming effect on adults in conference rooms. ...

When AI Grades Itself: The Quiet Failure of LLM-as-a-Judge in Clinical Translation

Translation is one of those AI use cases that sounds almost too reasonable to argue with. English medical data exist in large quantities. Many healthcare systems, researchers, and educators need non-English clinical text. Large language models are fluent, cheap, and obedient enough to produce thousands of translated reports before lunch. The spreadsheet smiles. The budget owner relaxes. The governance team is told that quality will be checked by another LLM. ...

Judging the Judges: How Bias-Bounded Evaluation Could Make LLM Referees Trustworthy

Scores look clean on dashboards. That is part of the problem. A model gets 4.7 out of 5. A customer-support agent receives a “pass.” A generated legal summary is marked “acceptable.” A coding assistant is judged “safe to deploy.” The number is tidy, the workflow continues, and everyone pretends the judge was a neutral instrument rather than another model with its own sensitivities, habits, and small theatrical preferences. ...

Cheap Signals, Expensive Insights: Rethinking AI Evaluation with Tensor Factorization

Budget is where evaluation systems usually lose their innocence. A team wants to compare several models across hundreds or thousands of prompts. The obvious answer is human evaluation. The less obvious invoice arrives later: annotator time, reviewer fatigue, prompt coverage gaps, inconsistent judgments, and the slow realization that “we evaluated the model” often means “we averaged away the only differences that mattered.” ...