Preferences

Opening — Why this matters now Benchmarks were supposed to be neutral referees. Instead, they’ve become unreliable narrators. Over the past two years, the gap between benchmark leadership and real-world usefulness has widened into something awkwardly visible. Models that dominate leaderboards frequently underperform in deployment. Smaller, specialized models sometimes beat generalist giants where it actually counts. Yet our evaluation rituals barely changed. ...