Competency Gaps: When Benchmarks Lie by Omission
Opening — Why this matters now Large Language Models are scoring higher than ever, yet complaints from real users keep piling up: over-politeness, brittle refusals, confused time reasoning, shaky boundaries. This disconnect is not accidental—it is statistical. The paper Uncovering Competency Gaps in Large Language Models and Their Benchmarks argues that our dominant evaluation regime is structurally incapable of seeing certain failures. Aggregate benchmark scores smooth away exactly the competencies that matter in production systems: refusal behavior, meta-cognition, boundary-setting, and nuanced reasoning. The result is a comforting number—and a misleading one. ...