Scores are comforting. That is their main commercial advantage.

A vendor can say its model reaches a certain accuracy on a benchmark, a leaderboard can rank systems neatly, and an internal AI team can report that the new model is “better” than the old one. Everyone gets a number. The procurement slide looks tidy. The risk committee, if mercifully sleepy, moves on.

Then the model fails in production on something the benchmark barely touched: refusing a bad request politely, explaining uncertainty, maintaining boundaries, handling time, or recognizing that a user is asking for more than cheerful compliance. The problem is not simply that the model is weak. The more awkward problem is that the evaluation may never have looked properly.

The paper Uncovering Competency Gaps in Large Language Models and Their Benchmarks introduces Competency Gaps, or CG, as a way to inspect both sides of that failure: where the model underperforms and where the benchmark itself has blind spots.1 The paper’s useful contribution is not another leaderboard result. It is a mechanism for asking whether the scoreboard is looking in the right places.

The phrase “benchmark gap” deserves to become part of the enterprise AI vocabulary. A model gap says the model performs badly on a concept. A benchmark gap says the test set barely covers that concept in the first place. The first is a capability problem. The second is an audit problem. In practice, the second often helps create the first, because what is not tested is rarely optimized. A model, like a student with access to past exam papers, learns what the institution rewards. This is not moral failure. It is incentives, wearing a GPU bill.

The real failure is compression, not measurement

Benchmarks compress many individual behaviors into a single aggregate score. That compression is not useless. It lets teams compare systems, track improvements, and avoid drowning in thousands of examples. But compression becomes dangerous when the thing being compressed is not evenly distributed.

The paper starts from a familiar evaluation problem: aggregate metrics hide sub-trends. A model can look competent overall while failing badly on specific topics, formats, or reasoning patterns. Existing benchmarks sometimes add topic labels, such as math subdomains or broad capability categories, but these labels are usually coarse, manually curated, and not easily comparable across benchmark families.

CG changes the unit of inspection. Instead of relying only on human-written categories, it maps benchmark examples and model performance into a fine-grained concept space derived from sparse autoencoders, or SAEs.

An SAE, in this setting, decomposes a model’s internal representations into sparse, labeled concept dimensions. Each concept is a direction in the SAE dictionary, with an autointerpretability label attached to it. The method then asks two questions:

  1. How much does each benchmark activate each concept?
  2. How well does the model perform on examples where that concept is activated?

That creates two diagnostic lenses.

Lens Question answered Practical meaning
Benchmark gap Which concepts are missing or underrepresented in the benchmark suite? The evaluation may be blind to important behaviors.
Model gap Which concepts are associated with poor model performance? The model may be weak where that concept appears.
Cross-benchmark view Which patterns repeat across benchmarks? Teams can distinguish isolated dataset quirks from broader evaluation structure.

This is the paper’s central move. CG does not merely disaggregate by human topic. It uses model-internal concept activations as the map. That makes the method more granular than ordinary topic labels and more comparable across different benchmarks than one-off manual taxonomies.

The mechanism is simple enough to state without pretending the math is the article. For each benchmark datapoint, CG computes how strongly each SAE concept activates across the token sequence, normalized so long examples do not dominate just because they are long. For benchmark coverage, it compares a concept’s activation within a benchmark to average activation across the benchmark’s concept dictionary. For model performance, it uses the benchmark’s own scoring function, then weights performance by concept activation. In plain English: if a concept lights up often in examples the model gets wrong, that concept becomes suspicious. If a concept barely lights up in the benchmark at all, the benchmark may not be testing it.

The authors apply this method primarily to Llama 3.1 8B Instruct across ten static benchmarks, with additional demonstrations across Gemma 2 2B Instruct, DeepSeek-R1-Distill-Llama-8B, Mistral-7B-Instruct-v0.1, and Qwen3-4B, plus arena-style and conventional capability benchmarks in the appendices. The paper is not claiming these are the only models that matter. It is showing what the method reveals when applied to a representative evaluation suite.

The benchmark is not neutral; it has a concept diet

The first major result is about benchmark coverage. Across the ten-benchmark suite, concept coverage is strongly right-skewed. A small set of concepts receives heavy representation; many others are lightly covered or absent.

This matters because aggregate scores are not democratic. If a benchmark repeatedly tests a narrow group of concepts, then the aggregate score mostly reflects that group. The long tail may technically exist, but it contributes little to the headline number. The scoreboard is therefore not only measuring the model. It is also expressing the benchmark’s diet.

The paper’s examples are revealing. Among the top coverage concepts in the main Llama analysis are items such as English Premier League football discussions and conversation boundary markers. That does not mean football is secretly the soul of intelligence, though some fans may insist otherwise. It means benchmark text distributions can contain strange recurring semantic concentrations. Those concentrations then shape what the aggregate score sees.

More importantly, the authors identify 314 concepts, about 1% of the concept dictionary in that main suite, as entirely missing. These missing concepts include AI metacognition, legal concepts, explanations of limitations, and regulatory classification or compliance requirements. The result is not “all benchmarks should cover everything.” That would be silly. A math benchmark does not need to test employment law. But a benchmark suite used to make broad claims about model competence should not systematically miss concepts that matter for real-world deployment.

The individual benchmark analysis sharpens the point. Every benchmark except Vectara misses at least 30% of all concepts in the SAE dictionary. Again, that fact alone is not an indictment; specialized benchmarks are supposed to specialize. The stronger evidence comes when the authors ask whether a benchmark misses concepts that appear relevant to its own stated purpose.

Their examples are useful because they are not abstract. AGI Eval is missing concepts such as the need for thorough and objective assessment of evidence and careful qualification of complex topics. LogicBench misses concepts related to explaining how different elements or factors relate to each other and mathematical/logical concepts across multiple languages. Social IQA misses concepts around defending planned actions against expectations and instructions about how someone should behave.

That is where the paper moves from “coverage is uneven” to “coverage can be misaligned.” A benchmark can be narrow by design. It should not be narrow by accident.

The model looks best where compliance is easy to reward

The second major result concerns model gaps. When Llama 3.1 8B’s benchmark performance is projected into concept space, the model’s best-performing concepts tend to involve coding, data handling, instruction following, illustrative examples, and positive commitment to help.

That is not surprising. These are the behaviors modern instruction-tuned systems are trained and evaluated to produce fluently. The model knows how to be useful, structured, and eager. It can iterate through a programming sequence. It can provide an example. It can say, in effect, “I will do my best.” Corporate demos adore this personality type.

The more revealing finding is on the other side. The worst-performing concepts include polite rejection, professional boundaries, time, image-manipulation-related metadata, palindrome and letter reasoning, mathematical operations, and appeals to intuition. The paper highlights concepts such as “the assistant needs to politely reject or redirect inappropriate requests” and “the assistant maintaining professional boundaries while offering appropriate help.”

This is the uncomfortable symmetry: helpfulness concepts score well, while concepts that constrain helpfulness score poorly.

That does not prove the model is “bad at safety” in a universal sense. CG is not a moral thermometer. It is showing that, in the evaluated benchmark suite, performance associated with boundary-setting and refusal-related concepts is weak relative to other concepts. The business interpretation is narrower and more useful: if your deployment depends on calibrated refusal, escalation, uncertainty, or professional restraint, a high aggregate benchmark score is weak evidence. You need a coverage and performance audit at the concept level.

The paper also connects these model gaps to known weaknesses. Time-related concepts appear as poor-performing areas. So do letter-level or palindrome-related reasoning patterns. In the Qwen3-4B appendix, time-related words appear again among worst-performing concepts, echoing the main finding. The paper’s DeepSeek-R1-Distill-Llama appendix is especially instructive: the model performs well on surface forms of mathematical reasoning, such as variables, LaTeX math code, and reasoning-step transitions, but poorly on substance-oriented concepts such as boxed numerical answers and multi-digit numbers.

That distinction is operationally important. A model can look like it is reasoning because it produces the external costume of reasoning: variables, equations, transitions, and “therefore.” CG can help separate that costume from the parts of the task where failure actually occurs. In other words, it can catch the difference between mathematical stage lighting and mathematical competence. The stage lighting has had a very good few years.

The paper’s tests are not all doing the same job

A useful reading of this paper needs to separate the main evidence from the support checks. Otherwise, the article becomes a pile of examples, which is how insight goes to die.

Paper component Likely purpose What it supports What it does not prove
Main Llama 3.1 8B results across ten static benchmarks Main evidence CG can expose skewed benchmark coverage and concept-level model weaknesses. That every model has the exact same gap profile.
Additional model appendices Robustness / generalization across model families Similar analyses can be run across different models and SAE variants. That all SAE labels are equally clean or equally interpretable.
Cross-SAE comparison using Llama and Gemma SAEs Robustness / sensitivity test Similar themes can emerge even when using a different SAE, useful when a model-specific SAE is unavailable. Exact concept-by-concept equivalence across SAE dictionaries.
Random 20% example dropping, repeated 100 times Stability test CG scores are relatively stable under random subsampling. That the method is immune to dataset design flaws.
Adversarial removal of salient datapoints Sensitivity / directional validation Removing examples tied to high- or low-performing concepts moves median performance in expected directions. That each surfaced concept is a causal mechanism by itself.
Qwen foreign-name experiment Exploratory causal validation A CG-surfaced weak concept can guide targeted behavioral tests. That every CG concept automatically corresponds to a causal failure mode.
Comparisons with garak, AutoDetect, Arena-Hard-Auto, and EvalTree Comparison with prior work CG offers finer-grained and cross-benchmark concept decomposition. That CG replaces red-teaming, autoraters, or benchmark-specific taxonomies.

The robustness tests are particularly important because SAE-based methods invite skepticism. Sparse autoencoders can learn different feature dictionaries, and autointerpretability labels can be noisy. The authors address this by comparing Llama-specific SAE results with Gemma-based SAE results. The correspondence is thematic rather than exact: one SAE may slice “time requirements and duration” while another represents a broader “dates and numeric sequences” concept. That is enough for CG’s intended use as discovery, not enough for pretending the concepts are canonical atoms of intelligence.

The perturbation tests serve a different role. Randomly dropping 20% of examples per benchmark over 100 reruns yields low standard deviations in the reported scores. That supports stability under subsampling. The adversarial ablation is more diagnostic: removing less than 1% of testing data associated with top-performing concepts lowers median performance, while removing data associated with worst-performing concepts raises it. The direction of change suggests that the concepts are not decorative labels pasted onto arbitrary clusters. They are tied to performance-relevant slices of the data.

The foreign-name experiment in the Qwen appendix is more concrete. CG identifies “Foreign languages, names” as a low-scoring concept. The authors then construct a LogicBench-style prompt with identical structure, comparing unnamed protagonists against protagonists named “Omar” and “Mahmoud.” The model’s correctness drops in the named setting. The exact percentages are not visible in the HTML rendering, so the responsible interpretation is qualitative: CG can generate hypotheses that targeted experiments can test. That is already valuable. Evaluation systems should not only report scores; they should suggest where to look next.

The benchmark-gap result matters more than the model-gap result

The obvious reading of the paper is: “Here are model weaknesses.” That reading is correct but shallow.

The more important reading is: “Here is a way to evaluate the evaluation system.”

In enterprise AI, teams often treat benchmarks as evidence that a model is safe enough, reliable enough, or suitable enough. But a benchmark suite is itself a product with coverage assumptions, distributional biases, and blind spots. CG exposes those assumptions. It asks whether the benchmark is testing the concepts that matter for the deployment profile.

For example, a customer-support assistant needs more than correctness on factual QA. It needs escalation judgment, refusal discipline, privacy awareness, uncertainty expression, and conversational repair. A coding assistant needs more than patch-generation performance on Python repositories. The SWE-Bench appendix illustrates this clearly: CG finds strong representation of web development code, data plots, analysis, and code testing, consistent with the repositories included in SWE-Bench. It also identifies compilation and computer security as underrepresented, consistent with the absence of compiler or security repositories among the benchmark’s twelve projects.

That is a quiet but powerful point. The benchmark may be good at what it covers. The mistake is to treat “what it covers” as “software engineering.”

The MMLU appendix offers a similar lesson. CG surfaces historical coverage gaps involving pairings suggestive of the Arab Spring and the Cuban Missile Crisis, and the authors report that keyword search and manual review found no MMLU questions on either topic. The issue is not that every modern history curriculum must worship the same canon. The issue is that a broad benchmark can look comprehensive while missing recognizable chunks of the domain.

For business use, the actionable question becomes:

Business decision Old benchmark question CG-style replacement question
Model selection Which model has the best aggregate score? Which model performs well on the concepts our workflow actually activates?
Vendor evaluation Did the vendor test safety and reliability? Which refusal, boundary, uncertainty, and compliance concepts were covered?
Internal benchmark design Do we have enough test cases? Which relevant concept regions are missing or overrepresented?
Fine-tuning Did accuracy improve after training? Which weak concept clusters improved, and which stayed untouched?
Governance Can we document model quality? Can we document both performance and coverage by concept family?

This is why the mechanism-first structure matters. If we start with findings, CG looks like another diagnostic dashboard. If we start with the mechanism, its business value becomes clearer: it is a way to audit the shape of evidence.

Concept coverage is not capability, and that distinction is not optional

The paper is careful about a limitation that practitioners must not ignore: concept activation is not capability.

If a benchmark example activates a concept, that does not mean the model understands the concept or can handle it competently. CG uses activation to attribute benchmark items and model performance into concept space. Capability still comes from the benchmark’s scoring function. In other words, activation tells us where to look; performance tells us whether the model succeeded there.

This distinction prevents a common misuse. A dashboard could show that a benchmark “covers” legal concepts because legal concepts activate in the text. That does not mean the benchmark tests legal reasoning well. It may contain legal words in shallow contexts. It may test recognition rather than judgment. It may reward formulaic answers. CG can surface the region; human and task-specific evaluation still need to inspect the examples.

The second limitation is the hard-question confound. A concept may receive a low model score because it appears in difficult items, not because the model has a specific weakness on that concept. If “advanced geometry” activates mostly in the hardest geometry problems, low performance may reflect item difficulty. The authors identify item-level difficulty calibration as future work. For practical use, teams should treat CG as a triage tool, not as a final causal diagnosis.

The third limitation is SAE coverage. CG can only detect gaps for concepts represented in the SAE space. If a concept is absent from the SAE dictionary, the method cannot surface it. Larger and more representative SAEs improve the chance of coverage, but they do not create omniscience. The machine cannot audit a dimension it does not have.

Finally, cross-SAE consistency should be interpreted thematically. The paper’s Llama/Gemma comparison suggests that useful patterns can survive a change in SAE, but the concepts are not one-to-one objects. This is acceptable for discovery and prioritization. It is not sufficient for legalistic claims like “we certified this exact concept.” Evaluation governance has enough ritual theater already. We need not add more.

What Cognaptus infers for business use

The paper directly shows that CG can map benchmark coverage and model performance into SAE concept space; that popular benchmark suites can be skewed; that models may perform better on helpful, coding, and instruction-following concepts than on refusal, boundary, time, and metacognitive concepts; and that the method has some robustness under cross-SAE comparison and perturbation tests.

Cognaptus infers three practical uses.

First, CG can become a benchmark selection layer. Before an enterprise accepts a model evaluation report, it should ask what concepts the benchmark suite covers. This is especially important for regulated, customer-facing, or safety-sensitive workflows. A benchmark suite that is strong on factual QA but thin on refusal and escalation is not “general.” It is a factual QA benchmark with a nicer suit.

Second, CG can guide targeted test-data creation. If an internal benchmark underrepresents boundary-setting, temporal reasoning, or domain-specific compliance concepts, teams can generate or collect new evaluation items for those regions. This is cheaper than blindly adding thousands of examples. More data is not the same as more coverage. Sometimes it is just the same blind spot with better formatting.

Third, CG can support post-deployment monitoring. As workflows evolve, the concept footprint of real user interactions may shift. A support assistant may start receiving more regulatory questions; a coding assistant may move from web patches to security-sensitive infrastructure work; a research assistant may face more citation-verification tasks. CG-style concept mapping could help detect that the benchmark suite no longer matches the deployed workload.

What remains uncertain is how well the method scales as a formal assurance tool. SAE labels vary in quality. Some concepts are syntactic rather than semantic. Item difficulty can distort interpretation. Domain experts still need to validate whether a surfaced concept is operationally relevant. CG gives better searchlights, not automatic judgment.

That is still a serious advance. In evaluation, better searchlights are often exactly what was missing.

The better question is not “What score did the model get?”

The old evaluation habit asks: what did the model score?

The better question is: what did the benchmark look at, what did it ignore, and where did the model fail when the ignored regions were finally named?

Competency Gaps is useful because it makes that question operational. It decomposes evaluation into model-internal concept space, measures benchmark coverage, maps performance onto activated concepts, and gives teams a way to inspect both the test and the tested system. The result is less comfortable than a leaderboard, but more honest.

For model builders, the lesson is that instruction-following strength can coexist with weak boundary-setting. For benchmark designers, the lesson is that coverage should be audited rather than assumed. For enterprise buyers, the lesson is blunt: a high benchmark score is not a risk disclosure. It is a compressed summary of a particular test distribution. Compression is where omissions go to hide.

Benchmarks do not only measure competence. They define what competence is allowed to mean. CG gives us a way to inspect that definition before we mistake it for truth.

Cognaptus: Automate the Present, Incubate the Future.


  1. Maty Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Chris Bregler, and Stephanie C. Y. Chan, “Uncovering Competency Gaps in Large Language Models and Their Benchmarks,” arXiv:2512.20638v2, 2026, https://arxiv.org/abs/2512.20638↩︎