Cover image

The Judge Is Not Always Right: Stress‑Testing LLM Judges

A judge is useful only if it can survive the boring parts of reality. Not the dramatic failure cases. Not the philosophical debates about machine intelligence. The boring parts: an extra blank line, a shorter answer, a paraphrased sentence, a multi-turn transcript where one message quietly changes the outcome, or a scoring rubric that asks for a number instead of a yes-or-no label. ...

March 6, 2026 · 16 min · Zelina