Competency Gaps: When Benchmarks Lie by Omission

Opening — Why this matters now

Large Language Models are scoring higher than ever, yet complaints from real users keep piling up: over-politeness, brittle refusals, confused time reasoning, shaky boundaries. This disconnect is not accidental—it is statistical.

The paper Uncovering Competency Gaps in Large Language Models and Their Benchmarks argues that our dominant evaluation regime is structurally incapable of seeing certain failures. Aggregate benchmark scores smooth away exactly the competencies that matter in production systems: refusal behavior, meta-cognition, boundary-setting, and nuanced reasoning. The result is a comforting number—and a misleading one.

Background — What benchmarks get wrong

Benchmarks work by compression. Thousands of behaviors are flattened into a single accuracy or win-rate. This works well when failures are uniformly distributed. Unfortunately, LLM failures are not.

Prior work has already shown that topic-level breakdowns (e.g., math subdomains) reveal wide variance hidden by averages. But topic labels are coarse, manually curated, and detached from how models internally represent language. Worse, benchmarks themselves are uneven: they over-test what is easy to measure and under-test what is socially or operationally hard.

In other words, we have model gaps (where the model underperforms on specific competencies) and benchmark gaps (where evaluations fail to test those competencies at all). Until now, we lacked a scalable way to see either clearly.

Analysis — What the paper actually does

The paper introduces Competency Gaps (CG), a representation-grounded evaluation method built on Sparse Autoencoders (SAEs).

Instead of asking, Did the model get this question right?, CG asks:

Which internal concepts were activated?
How well does the model perform when those concepts matter?
How often do benchmarks even test those concepts?

The core mechanism

SAE concept extraction: Dense hidden states are decomposed into thousands of sparse, human-interpretable concepts.
Concept activation scoring: Each benchmark datapoint is mapped into concept space.
Coverage metrics: Measure how well benchmarks cover each concept.
Performance metrics: Measure how well the model performs conditional on concept activation.

This yields two orthogonal lenses:

Dimension	What it reveals
Benchmark Gaps	Concepts benchmarks systematically ignore
Model Gaps	Concepts the model consistently fails

Crucially, this works across benchmarks, enabling apples-to-apples comparisons that traditional evaluations cannot support.

Findings — What falls through the cracks

The results are uncomfortable—and consistent.

1. Benchmarks are heavily skewed

Across ten widely used benchmarks, concept coverage follows a long-tail distribution:

A small set of concepts (instruction-following, obedience, formatting) dominates.
Hundreds of concepts are barely tested.
~1% of concepts are completely missing.

Missing concepts disproportionately involve:

Meta-cognition (explaining limitations, referencing user intent)
Safety and refusal reasoning
Professional boundary maintenance
Legal and regulatory awareness

2. Models fail where benchmarks are thin

When performance is projected into concept space, the pattern mirrors benchmark coverage almost perfectly.

Best-performing concepts:

Coding patterns
Step-by-step explanations
Positive, compliant user-facing behavior

Worst-performing concepts:

Polite refusal and redirection
Boundary enforcement
Temporal reasoning
Intuitive reasoning without explicit scaffolding

This is not coincidence. Models optimize for what is rewarded—and benchmarks rarely reward saying “no” well.

3. Aggregate scores exaggerate competence

Because overrepresented concepts also tend to be easy and well-trained, aggregate scores are dominated by a narrow behavioral slice. The headline metric mostly measures obedience fluency, not reasoning breadth.

Implications — What this means for practice

For model developers

High benchmark scores do not imply robust real-world behavior.
Safety and refusal failures are not edge cases—they are under-measured defaults.
Instruction tuning without balanced evaluation creates behavioral monocultures.

For benchmark designers

Adding more questions is not the solution; adding missing concepts is.
CG provides a feedback loop for targeted data generation.
Benchmarks should declare their conceptual coverage explicitly, not implicitly.

For regulators and enterprise buyers

Vendor benchmarks are incomplete risk disclosures.
Representation-grounded audits offer a more defensible evaluation layer.
“Model quality” must be decomposed before it can be governed.

Conclusion — Evaluation without illusions

Competency Gaps does not replace benchmarks. It exposes their blind spots.

The uncomfortable truth is that we have been grading language models on a curved exam written by ourselves. CG forces us to confront what we forgot to test—and, more importantly, what we taught models they never had to learn.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — What benchmarks get wrong#

Analysis — What the paper actually does#

The core mechanism#

Findings — What falls through the cracks#

1. Benchmarks are heavily skewed#

2. Models fail where benchmarks are thin#

3. Aggregate scores exaggerate competence#

Implications — What this means for practice#

For model developers#

For benchmark designers#

For regulators and enterprise buyers#

Conclusion — Evaluation without illusions#