When the Referee Wants to Be Nice: Hidden Bias in AI Judges
Opening — Why this matters now Everyone wants AI that can evaluate AI. It is cheaper than humans, faster than humans, and—according to many slide decks—more scalable than reality itself. Modern AI pipelines increasingly rely on LLM-as-a-judge systems to rate safety, quality, policy compliance, and readiness for deployment. These judges decide whether a model is helpful, harmful, safe, or suspect. Conveniently, they do so without lunch breaks. ...