Procurement meetings have a familiar ritual now. Someone opens a leaderboard, sorts by average score, points at a model near the top, and asks why the company is not using that one.
It feels empirical. It is neatly ranked. It has decimals. Very scientific-looking decimals, the most seductive species of decimal.
The problem is not that leaderboards are useless. The problem is that we often treat them as scoreboards when they are closer to measurement instruments. A scoreboard tells us who won under agreed rules. A measurement instrument first has to prove that it measures the thing it claims to measure. If the instrument mixes model size, benchmark difficulty, contributor practices, post-training choices, item redundancy, and residual artifacts into one number, then the number may still be useful. It is just not self-explanatory.
That is the central value of AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems.1 The paper does not merely complain that LLM leaderboards are noisy. That would be a small contribution, and frankly an unsurprising one. Its more useful move is methodological: it treats a leaderboard ecosystem as a measurable landscape, then asks three linked questions.
First, what latent structure explains the benchmark responses? Second, how much variance comes from ecosystem facets such as benchmark identity, contributor/provenance, architecture, and deployment choices? Third, after accounting for structure and noise, what does model size actually scale?
This mechanism-first reading matters because the obvious summary—“leaderboards are noisy, be careful”—is too cheap. The paper’s real message is sharper: leaderboard scores can be decomposed into different kinds of signal, and different business decisions need different decompositions.
The reader mistake is treating benchmark averages as named capabilities
A common interpretation goes like this: if a model has a higher benchmark average, it is simply “better”; if it scores higher on a reasoning benchmark, it has more reasoning ability; if a scaling curve rises, bigger models are reliably improving the target capability.
That interpretation is convenient. It is also doing several unpaid jobs at once.
A benchmark score is usually an aggregate of item-level responses. The reported benchmark label—math, knowledge, reasoning, instruction following—suggests a construct. But the total score only preserves that construct if the items behave as a valid measurement scale. In more formal terms, the score should reflect a latent ability plus tolerable measurement error. If instead the score absorbs a general model-quality factor, benchmark-specific residue, item dependence, contributor-specific practices, and benchmark difficulty, then the label becomes a polite guess attached to a statistical mixture.
The paper’s replacement view is not “ignore the leaderboard.” It is: stop reading the leaderboard as a direct capability map. Read it as an instrument output that needs calibration.
That difference is operational. A bank choosing between open models for internal assistants, a SaaS company benchmarking code agents, or a consulting firm deciding whether a fine-tuning recipe improved reasoning should not ask only, “Which score is higher?” They should ask, “Which part of the score moved, and which part of that movement is reliable for our decision?”
The paper’s mechanism has three layers, and the order matters
The paper combines Confirmatory Factor Analysis, Generalizability Theory, and mixed-effects latent regression on more than 4,000 submissions from the Hugging Face Open LLM Leaderboard. The analyzed tasks are IFEval, BBH, MATH Level 5, GPQA, MuSR, and MMLU-Pro.
The three methods are not interchangeable tools placed side by side. They form a sequence.
| Layer | Question | Method | What it prevents |
|---|---|---|---|
| Latent structure | What is the leaderboard actually measuring? | Confirmatory Factor Analysis / SEM across competing factor structures | Treating benchmark labels as validated constructs |
| Ecosystem variance | Where does observed score variation come from? | Generalizability Theory with crossed random effects | Confusing architecture, contributor, deployment, benchmark, and residual effects |
| Denoised scaling | What does size explain after structure and noise controls? | Bifactor IRT plus mixed-effects latent regression | Turning one aggregate scaling curve into a universal capability law |
This order is the article’s organizing spine. Structure comes first because there is no point decomposing the noise around a construct before asking whether the construct is being measured coherently. Variance decomposition comes second because even a good latent structure can still be distorted by ecosystem facets. Latent regression comes third because scaling laws estimated on raw averages inherit all the previous confusion.
A shorter version: first identify the instrument, then estimate its noise, then infer what drives the cleaned signal. Very unfashionable. Also very useful.
Layer one: the leaderboard looks more like a bifactor system than six independent tests
The first method tests competing latent structures. This sounds abstract, but the business question is plain: when a benchmark suite reports six scores and one average, are those six scores measuring six mostly independent abilities, one general ability, or something in between?
The paper compares several candidate structures, including independent benchmark factors, one general factor, hierarchical factors, correlated benchmark factors, a bifactor model, and a correlated bifactor model. The bifactor family is the important one. In a bifactor structure, every item loads on a general factor, while also loading on a benchmark-specific factor. In business language, part of performance is broad cross-domain model competence, and part is whatever remains specific to a benchmark after that broad competence is removed.
That is a more subtle view than “reasoning benchmark equals reasoning.” It says: a reasoning benchmark score may partly reflect general model strength, and only the residual part tells us whether the model has something benchmark-specific beyond the general factor.
The results support this two-level view. Bifactor-family models best explain the observed dependence structure across item-resampled comparisons. A strong general factor appears stable, while benchmark-specific factors are largely residual. The paper also finds that benchmark labels do not fully organize residual dependence: even flexible structural models retain notable fit under randomized item-to-benchmark assignment, and local diagnostics reveal residual item-pair couplings that the tested latent structures do not absorb.
This is where the paper becomes more than leaderboard criticism. It distinguishes between three possibilities that raw scores collapse:
| Raw score interpretation | What the paper asks instead | Practical consequence |
|---|---|---|
| “This model is better at reasoning.” | Is the gain on a reasoning benchmark driven by the general factor or by the benchmark-specific residual factor? | Do not buy “reasoning improvement” if the result is mostly general capability movement. |
| “This benchmark measures its label.” | Do item dependencies remain after the latent factors are modeled? | Treat residual dependence as a measurement warning, not a small technical footnote. |
| “Average score is a fair summary.” | Does the average preserve the latent profile or hide trade-offs? | Use factor-level reporting when decisions depend on specific abilities. |
The important phrase is “construct-preserving.” A benchmark total score is construct-preserving only if it summarizes the intended ability without smuggling in too much else. The paper argues that current reporting often fails that standard. Not because benchmarks are fraudulent. Because aggregation is doing more work than the labels admit.
The local-dependence diagnostics are a measurement failure detector, not a semantic gossip machine
The paper uses modification indices and standardized expected parameter changes to detect residual dependence among items after fitting the latent structures. This part is easy to misuse, so it is worth slowing down.
A high residual association does not automatically mean two questions are semantically redundant. It means the fitted latent structure still under-explains their relationship. That leftover relationship may come from item overlap, shared formatting artifacts, model-specific shortcut behavior, unmodeled heterogeneity, or another method effect. The paper’s appendix is careful about this: the diagnostic identifies where the constraint is wrong; it does not prove the causal reason.
For business readers, this is an unusually useful distinction. A benchmark audit should not jump from “these items are locally dependent” to “delete these duplicate questions.” Sometimes deletion is right. Sometimes the more important finding is that a class of models is exploiting a hidden artifact. Sometimes the benchmark is measuring a second construct that the designer did not name.
The diagnostic is therefore a map of suspicious measurement pressure. It tells an evaluation team where to inspect, not what story to write before inspection. A small mercy in an industry that often writes the story first.
Layer two: benchmark difficulty dominates, but contributor/provenance is the more interesting business signal
After asking what the leaderboard measures, the paper asks where the observed score variance comes from. Here it uses Generalizability Theory, modeling benchmark score observations across four facets: architecture, benchmark, contributor/provenance, and deployment.
The headline result is that benchmark identity explains the largest share of variance. In the no-covariate variance decomposition, the benchmark facet accounts for about 43.0% of total variance. In the covariate model, the benchmark share is about 46.13%. That is not shocking. Different benchmarks have different difficulty levels. Still, it matters because it means a large part of the leaderboard landscape is predictable nuisance variation.
For business practice, nuisance does not mean irrelevant. It means controllable. Benchmark difficulty can be handled through stratified reporting, difficulty-normalized scoring, or explicit modeling. The paper’s point is not that benchmark difficulty contaminates everything hopelessly. It is that pretending it is not there makes the composite score look cleaner than it is.
The more interesting result is contributor/provenance. In the no-covariate model, contributor explains about 8.6% of total variance, while architecture explains about 4.17% and deployment type about 1.91%. In the covariate model, contributor remains around 7.82%, architecture around 6.34%, and deployment around 1.10%.
This should make model buyers slightly uncomfortable. Contributor/provenance is not the same thing as “the person uploaded a model.” The paper’s operational grouping includes Hugging Face account and availability-related metadata; it is a proxy for submission source, documentation, model availability, and unobserved training/provenance differences. It should not be interpreted causally as a person-level effect.
But as a signal, it is hard to ignore. Contributor-linked variation can exceed architecture and deployment main effects in this ecosystem. That means some rank-relevant performance differences may reflect recipes, data mixtures, tuning practices, documentation practices, release behavior, or other provenance-linked factors more than the base architecture label that procurement slides love to compare.
This creates two distinct evaluation modes.
| Decision mode | Treat contributor/provenance as… | Evaluation implication |
|---|---|---|
| Architecture-centric evaluation | A nuisance/confounder | Adjust for provenance before claiming one architecture is inherently better. |
| Method-centric evaluation | A signal | Study provenance-linked variation as evidence of training or implementation practice differences. |
| Procurement evaluation | A risk factor | Prefer model cards, reproducibility, availability, and evaluation traceability over naked rank. |
This is the part of the paper that feels most immediately business-relevant. If a firm is choosing a model vendor, “which contributor ecosystem produced this result?” is not a footnote. It may be part of the product.
Layer three: scaling is a vector, not a single heroic curve
The third layer moves from leaderboard scores to scaling claims. The standard habit is to regress score on model size and infer a scaling law. The paper argues that this is too blunt because raw benchmark scores mix the general factor, benchmark-specific residuals, benchmark difficulty, contributor effects, deployment effects, and measurement noise.
The latent regression model therefore estimates size effects in latent space. Instead of asking whether the average score scales with model size, it asks how log-parameter count relates to the general factor and to each benchmark-specific residual factor after controls.
The difference is not cosmetic. The paper reports that the general factor has the strongest and most stable size relationship: the log-parameter coefficient for the general factor is 1.05 with slope reliability 0.97 in the latent regression table. That supports a reasonable claim: increasing model size robustly improves broad cross-domain capability in this leaderboard ecosystem after controls.
But the benchmark-specific slopes look very different. The same table reports much weaker or less reliable size effects for several specific factors: BBH has a size coefficient of -0.05 with reliability 0.05; GPQA has 0.07 with reliability 0.20; IFEval has -0.08 with reliability 0.39; MuSR has 0.04 with reliability 0.07; MATH Level 5 has 0.13 with reliability 0.18. MMLU-Pro is the notable exception, with a size coefficient of 0.52 and reliability 0.57.
So the clean business interpretation is not “scale works” or “scale is over.” Both are lazy. The interpretation is: scale appears robust for a broad general factor and for some knowledge-heavy residual structure, but it is much less convincing as a universal explanation for benchmark-specific reasoning or instruction-following gains.
That distinction matters when companies interpret model roadmaps. If a provider says, “Our bigger model improved the benchmark average,” the right follow-up is not applause. It is: which latent dimension improved? If the gain is mostly the general factor, the model may be broadly stronger but not necessarily better at the specific workflow risk you care about. If a customer-service automation system needs reliable instruction following, or a research assistant needs robust multi-step reasoning, the aggregate scaling curve is not enough evidence.
The paper’s benchmark-specific results are useful, but only if read as residual claims
The paper offers several benchmark-specific observations. They are interesting precisely because they are not raw-score claims.
MMLU-Pro, interpreted as general knowledge, retains a meaningful denoised scaling effect beyond the general factor. That suggests knowledge-intensive performance is strongly linked to scale in this ecosystem.
By contrast, reasoning-focused factors such as GPQA and BBH show trivial denoised scaling effects after controls. This does not prove that larger models cannot reason better. It says that, once the paper removes the general factor and ecosystem effects, these benchmark-specific residual factors do not show a strong size-driven relationship in this dataset.
IFEval is also revealing. The paper finds no significant size effect for the IFEval-specific factor after controls. Gains there appear driven more by tuning and deployment choices than by raw scale. That is exactly the kind of result enterprises should care about. If instruction following is the target, buying more parameters may be less efficient than selecting or creating the right post-training recipe.
MuSR is more delicate. The paper reports a small negative relationship with model size in the fully controlled model and notes that several covariates associated with improved instruction following show negative effects on the MuSR factor. It suggests a possible trade-off between instruction-following optimization and “soft reasoning.” The word “possible” is doing real work here. This is not a universal law of alignment tax. It is ecosystem-level evidence that current optimization practices do not uniformly improve all latent dimensions.
That is the business lesson: post-training can move different abilities in different directions. A single average can hide capability substitution. A model that becomes more obedient may not become more flexible in every reasoning mode. The dashboard says “up”; the latent profile may say “up here, flat there, slightly worse over there.” Annoying, yes. Also the point of measurement.
The top of the leaderboard is where noise becomes most expensive
The paper’s ranking analysis is the most procurement-friendly result. After conditioning on the latent general factor and using precision-weighted ranks, the broad ranking distribution is largely preserved. Around 90% of models remain within the top decile. This means the paper is not saying the leaderboard is random.
But the top 1% is fragile. Only one model in the original top 1% remains in the top 1% after conditioning. The bottom 1% behaves differently: 64% remain in the bottom 1%.
This asymmetry is exactly where business decisions get expensive. Most firms are not deciding whether to use a clearly terrible model. They are deciding among models clustered near the top, where small differences become marketing claims, vendor negotiation points, and architecture-selection arguments. That is also where measurement noise can reshuffle winners.
The lesson is not “ignore the top models.” It is: the closer the decision margin, the more the evaluation must move from rank reading to uncertainty-aware diagnosis.
A practical policy would look like this:
| Decision context | Acceptable evidence | What to avoid |
|---|---|---|
| Broad screening | Raw leaderboard decile plus basic capability filters | Over-interpreting tiny score gaps |
| Vendor shortlist | Task-specific evaluation plus provenance and deployment metadata | Treating average rank as procurement proof |
| Final model selection | Internal benchmark, uncertainty intervals, failure-mode analysis, and cost/security constraints | Selecting the top public rank because it looks objective |
| Training intervention claim | Latent or at least stratified analysis separating general and specific gains | Claiming “reasoning improved” from aggregate score movement |
The top rank is not worthless. It is just not a warranty.
What the evidence directly shows, and what Cognaptus infers
The paper directly shows several things within the studied Open LLM Leaderboard ecosystem.
First, benchmark scores are better understood through latent structure than through naive independent benchmark labels. A strong general factor dominates much cross-benchmark covariance, while benchmark-specific factors carry residual information.
Second, ecosystem facets matter. Benchmark identity accounts for a large share of variance. Contributor/provenance explains a nontrivial and rank-relevant share, larger than architecture and deployment main effects in the reported decompositions.
Third, scaling is not one scalar law. The general factor scales strongly and reliably with model size, while many benchmark-specific residual factors show weak, negligible, or context-dependent scaling effects.
Fourth, rankings are broadly stable at coarse levels but unstable at the very top. That combination is important: the leaderboard is directionally informative, yet too noisy for fine-grained winner selection without additional controls.
Cognaptus infers three operational principles from this.
The first is evaluation separation. Companies should separate general capability screening from specific workflow validation. A leaderboard average is useful for the former. It is inadequate for the latter.
The second is provenance-aware procurement. Model selection should treat contributor, documentation, availability, tuning recipe, and deployment metadata as decision variables, not decoration. If provenance-linked variance is large, then provenance is part of model quality.
The third is scaling-law humility. Bigger models may improve broad capability, but business workflows buy specific reliability under constraints. The relevant question is not whether scale improves the average. It is whether scale improves the failure mode you cannot afford.
The appendix tests robustness, not a second thesis
The paper’s appendices are unusually important because several main claims depend on whether the measurement procedure itself is stable. The robustness material should be read as support for the mechanism, not as a separate story.
| Appendix component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Item-resampled bootstrapping | Robustness and computational tractability | Structural conclusions are not driven by one item subset | That the sampled leaderboard represents all LLM evaluation contexts |
| Out-of-sample AUC/MAE for CFA structures | Overfitting check | Bifactor is not chosen only because it is flexible | That bifactor factors have fixed semantic meanings across ecosystems |
| Permutation controls | Sensitivity test against benchmark-label overinterpretation | Benchmark labels only partially organize residual structure | That benchmark labels carry no information |
| Modification-index / SEPC maps | Local-dependence diagnostic | Some item relationships remain unexplained after latent modeling | The causal reason for each residual coupling |
| Granularity and Bayesian variance checks | Robustness of variance decomposition | Contributor > architecture ordering is not just one coding artifact | Causal attribution to contributors or training practices |
| Inverse-variance aggregation | Stability of bootstrapped estimates | More precise bootstrap estimates get more weight | That all unobserved metadata noise has been removed |
This matters because a weak reading of the paper would overclaim the appendices. A stronger reading is more useful: the appendices make the main cartographic mechanism credible enough to use as an evaluation template.
Boundaries: useful map, not universal geography
The paper’s limitations are not ceremonial. They define how the results should be used.
The empirical findings come from one leaderboard snapshot and six benchmarks. The exact variance shares and latent structure are ecosystem-dependent. A corporate benchmark suite for customer-support agents, code repair, medical summarization, or financial research may have a different factor structure.
The metadata is observational and noisy. Contributor/provenance should not be read as a causal explanation. It is a useful grouping that captures structured differences associated with submission source and availability metadata, not a clean experimental treatment.
The Open LLM Leaderboard is not the population of all LLMs. It is an open-submission ecosystem. Models submitted there may reflect particular incentives, release practices, and optimization patterns.
There is also a deeper psychometric boundary. A bifactor model is a statistical structure, not an ontology of intelligence. The general factor is not “intelligence itself,” wearing a tiny lab coat. It is the dominant shared variance pattern in this item pool under this model family. That is valuable, but it is not metaphysics.
Finally, benchmark analysis can itself strengthen leaderboard-centric optimization. Better measurement can help organizations avoid overconfident claims, but it can also make hill-climbing more efficient. The responsible use is not to worship a cleaner score. It is to connect measurement to real failure modes, task requirements, and human oversight.
From leaderboard watching to evaluation due diligence
The business value of this paper is cheaper diagnosis before expensive commitment.
A leaderboard can still be the first filter. It can tell an evaluation team which models deserve attention. But the next step should be due diligence: identify whether the score reflects general capability, benchmark-specific residual ability, benchmark difficulty, provenance-linked implementation choices, deployment effects, or local-dependence artifacts.
That due diligence does not require every company to reproduce the full paper. Few procurement teams are going to run mixed-effects bifactor IRT before lunch, and that is probably good for morale. But the logic can be translated into an evaluation checklist:
- Report benchmark results separately before averaging them.
- Compare broad capability scores with task-specific internal tests.
- Track metadata: base model, contributor, training recipe, tuning style, quantization, availability, and documentation quality.
- Treat tiny differences among top-ranked models as unstable until tested under company-specific workloads.
- Distinguish scale-driven improvement from post-training-driven improvement.
- Audit residual failure patterns instead of celebrating aggregate gains.
This is the shift from leaderboard watching to evaluation governance. It is less glamorous than sorting a table and declaring a winner. It is also less likely to make a firm buy the wrong model because the decimal looked confident.
Conclusion: the rank is a clue, not the case
AI Cartography gives us a better way to read LLM benchmark ecosystems. It says that leaderboards are not meaningless, but they are not transparent either. They are compressed outputs of a measurement system whose latent structure, ecosystem variance, and scaling behavior must be modeled before fine-grained claims become trustworthy.
For AI builders, the paper warns against claiming that a training intervention improved a named capability when the movement may belong mostly to a general factor or provenance-linked artifact. For model buyers, it warns against selecting near-top models by raw rank when the decision margin is exactly where noise is most damaging. For benchmark designers, it offers a route toward reporting uncertainty, residual structure, and latent profiles rather than pretending the average score has done all the intellectual labor.
The leaderboard rank is a clue. The measurement model is the case file.
And as every competent investigator knows, the clue that looks cleanest is often the one that needs dusting for fingerprints.
Cognaptus: Automate the Present, Incubate the Future.
-
Michael Hardy, Anka Reuel, Lijin Zhang, Jodi M. Casabianca, Sang Truong, Yash Dave, Hansol Lee, Benjamin Domingue, and Sanmi Koyejo, “AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems,” arXiv:2605.25272v1, 24 May 2026, https://arxiv.org/abs/2605.25272. ↩︎