Opening — Why this matters now
Large Language Models are scoring higher than ever, yet complaints from real users keep piling up: over-politeness, brittle refusals, confused time reasoning, shaky boundaries. This disconnect is not accidental—it is statistical.
The paper Uncovering Competency Gaps in Large Language Models and Their Benchmarks argues that our dominant evaluation regime is structurally incapable of seeing certain failures. Aggregate benchmark scores smooth away exactly the competencies that matter in production systems: refusal behavior, meta-cognition, boundary-setting, and nuanced reasoning. The result is a comforting number—and a misleading one.
Background — What benchmarks get wrong
Benchmarks work by compression. Thousands of behaviors are flattened into a single accuracy or win-rate. This works well when failures are uniformly distributed. Unfortunately, LLM failures are not.
Prior work has already shown that topic-level breakdowns (e.g., math subdomains) reveal wide variance hidden by averages. But topic labels are coarse, manually curated, and detached from how models internally represent language. Worse, benchmarks themselves are uneven: they over-test what is easy to measure and under-test what is socially or operationally hard.
In other words, we have model gaps (where the model underperforms on specific competencies) and benchmark gaps (where evaluations fail to test those competencies at all). Until now, we lacked a scalable way to see either clearly.
Analysis — What the paper actually does
The paper introduces Competency Gaps (CG), a representation-grounded evaluation method built on Sparse Autoencoders (SAEs).
Instead of asking, Did the model get this question right?, CG asks:
- Which internal concepts were activated?
- How well does the model perform when those concepts matter?
- How often do benchmarks even test those concepts?
The core mechanism
- SAE concept extraction: Dense hidden states are decomposed into thousands of sparse, human-interpretable concepts.
- Concept activation scoring: Each benchmark datapoint is mapped into concept space.
- Coverage metrics: Measure how well benchmarks cover each concept.
- Performance metrics: Measure how well the model performs conditional on concept activation.
This yields two orthogonal lenses:
| Dimension | What it reveals |
|---|---|
| Benchmark Gaps | Concepts benchmarks systematically ignore |
| Model Gaps | Concepts the model consistently fails |
Crucially, this works across benchmarks, enabling apples-to-apples comparisons that traditional evaluations cannot support.
Findings — What falls through the cracks
The results are uncomfortable—and consistent.
1. Benchmarks are heavily skewed
Across ten widely used benchmarks, concept coverage follows a long-tail distribution:
- A small set of concepts (instruction-following, obedience, formatting) dominates.
- Hundreds of concepts are barely tested.
- ~1% of concepts are completely missing.
Missing concepts disproportionately involve:
- Meta-cognition (explaining limitations, referencing user intent)
- Safety and refusal reasoning
- Professional boundary maintenance
- Legal and regulatory awareness
2. Models fail where benchmarks are thin
When performance is projected into concept space, the pattern mirrors benchmark coverage almost perfectly.
Best-performing concepts:
- Coding patterns
- Step-by-step explanations
- Positive, compliant user-facing behavior
Worst-performing concepts:
- Polite refusal and redirection
- Boundary enforcement
- Temporal reasoning
- Intuitive reasoning without explicit scaffolding
This is not coincidence. Models optimize for what is rewarded—and benchmarks rarely reward saying “no” well.
3. Aggregate scores exaggerate competence
Because overrepresented concepts also tend to be easy and well-trained, aggregate scores are dominated by a narrow behavioral slice. The headline metric mostly measures obedience fluency, not reasoning breadth.
Implications — What this means for practice
For model developers
- High benchmark scores do not imply robust real-world behavior.
- Safety and refusal failures are not edge cases—they are under-measured defaults.
- Instruction tuning without balanced evaluation creates behavioral monocultures.
For benchmark designers
- Adding more questions is not the solution; adding missing concepts is.
- CG provides a feedback loop for targeted data generation.
- Benchmarks should declare their conceptual coverage explicitly, not implicitly.
For regulators and enterprise buyers
- Vendor benchmarks are incomplete risk disclosures.
- Representation-grounded audits offer a more defensible evaluation layer.
- “Model quality” must be decomposed before it can be governed.
Conclusion — Evaluation without illusions
Competency Gaps does not replace benchmarks. It exposes their blind spots.
The uncomfortable truth is that we have been grading language models on a curved exam written by ourselves. CG forces us to confront what we forgot to test—and, more importantly, what we taught models they never had to learn.
Cognaptus: Automate the Present, Incubate the Future.