The Sealed Score: Why AI Evaluation Needs an Exam Day
A leaderboard score is useful until everyone starts treating it as a target. That is the uncomfortable business problem behind LLM Olympiad: Why Model Evaluation Needs a Sealed Exam.1 The paper is not arguing that benchmarks are useless. That would be theatrical, and not especially true. It argues something sharper: in the LLM era, a benchmark score is only as credible as the procedure that produced it. ...