Judging the Judges: When AI Evaluation Becomes a Fingerprint
The evaluator is not the scale Evaluation looks boring until it changes the winner. A product team compares three candidate responses. A benchmark ranks five model releases. A content workflow asks an LLM judge to score generated SEO packs. The spreadsheet fills itself politely: five rubric dimensions, an overall score, maybe a few quoted receipts. Everyone pretends the judge is just a thermometer. ...