Benchmarks Without Borders: Inside the Moduli Space of AI Psychometrics
Opening — Why this matters now The AI industry is obsessed with benchmarks. Every model launch arrives with an arsenal of charts—MMLU, GSM8K, HumanEval—paraded as proof of competence. Unfortunately, the real world has an annoying habit of not looking like a benchmark suite. As AI systems become multi-modal, agentic, tool-using, and deployed in mission‑critical workflows, the industry faces a structural question: How do you evaluate general intelligence when the space of possible tasks is effectively infinite? ...