TL;DR for operators
The usual AI strategy story is simple: whoever spends the most on compute owns the future. The paper behind this article makes a more awkward claim: under current language-model scaling assumptions, massive compute advantage may be a temporary lead, not a permanent moat.1
The mechanism is not magic. It is diminishing returns. Chinchilla-like scaling laws imply that each additional unit of training compute buys a smaller reduction in loss. Meanwhile, hardware improvement and algorithmic progress are shared forces. They do not only help the largest labs. They also make yesterday’s “small” budget more capable. The result is a curve where frontier models pull ahead, peak in relative advantage, and then become less distinguishable from cheaper models.
The paper does not claim small models already equal frontier systems. Nor does it claim all capabilities converge everywhere. Its argument is conditional: if models keep improving mainly by learning a fixed distribution of human-generated text, then raw compute scaling becomes a weaker source of long-term inequality. That condition matters. Reinforcement learning, synthetic data, agentic environments, adversarial competition, and specialised domains can break the clean version of the story.
For businesses, the practical implication is blunt. Buying access to the most expensive model may still be useful, but it is a lousy foundation for durable advantage. If frontier capability diffuses downward through cheaper inference, distillation, sparse attention, speculative decoding, better open models, and software wrappers, then the moat moves elsewhere: proprietary workflows, data rights, evaluation systems, regulated deployment, latency-sensitive integration, auditability, and customer distribution.
So the operating question is not “Should we use the biggest model?” Sometimes, yes. The better question is: “Which parts of our AI stack will still matter when good-enough intelligence is cheap?”
Budget is the obvious moat. Diminishing returns are the quiet demolition crew.
The AI market has spent years treating compute as the new oil, the new electricity, and occasionally the new feudal landholding system. This is understandable. Frontier models are expensive. Training them requires enormous capital, scarce chips, engineering depth, and the patience to watch invoices become geological formations.
But the paper’s central move is to separate absolute performance from relative advantage. A frontier model can keep improving while its advantage over modest models shrinks. That sounds contradictory only if one assumes progress is linear. Scaling laws say it is not.
The paper models the training-loss advantage between a state-of-the-art model and a “meek” model: a model trained or run under a fixed, limited compute budget. Its starting point is a Chinchilla-style relationship between optimal compute and loss:
The important part is the exponent. As compute $C$ rises, loss falls, but the incremental gain gets smaller. That is the entire mechanism, and also the part executives tend to politely ignore because it ruins simple graphs.
The authors combine this with three growth forces:
| Force | How the paper treats it | Why it matters |
|---|---|---|
| Hardware improvement | Approximately $g_h = 1.4$ per year | Makes each dollar buy more computation over time |
| Algorithmic progress | Approximately $g_{\text{alg}} = 2.8$ per year | Makes models learn more efficiently from the same effective compute |
| Frontier compute investment | Approximately $g_i = 5 / 1.4 \approx 3.57$ per year | Lets large labs scale faster than ordinary users |
The frontier lab still has an enormous advantage. The paper is not pretending that a hobbyist with a laptop is secretly OpenAI in a hoodie. Rather, it asks what happens over time when both sides benefit from shared hardware and algorithmic progress, while the frontier side also grows investment exponentially.
The answer is the paper’s first major result: the loss gap rises, reaches an inflection point, then declines. In the paper’s Figure 1, a model with roughly $3.6\times$ yearly compute scaling initially opens a gap over a fixed-budget $1,000$ meek model. Then the gap peaks at around three to four years and starts falling. The expensive model keeps improving, but each extra investment buys less relative separation.
This is the argument’s most business-relevant sentence: a company can keep spending more and still get less moat per dollar.
That does not mean the spending is irrational. During the period of widening advantage, frontier access can be strategically valuable. It may support better products, faster research, stronger safety testing, more data capture, and brand credibility. But the paper’s model suggests that this period may be a window, not an empire.
The paper is about relative capability, not romantic small-model populism
A likely bad reading of the paper is: “Small models will beat frontier models.” That is not what the authors show.
A better reading is: “Under certain scaling assumptions, the frontier advantage narrows because the frontier approaches the irreducible parts of the task distribution.” That sentence is less tweetable. It is also more useful.
The fixed-distribution assumption is doing real work. The paper’s convergence story depends on models learning a stable distribution of human text. If there are only so many statistical regularities in that distribution, then larger models eventually spend more compute learning rarer and narrower patterns. Smaller models, helped by better algorithms and cheaper hardware, catch up on the broader patterns that matter for many everyday tasks.
This distinction matters for business adoption. The paper does not imply that every organisation should abandon frontier models and deploy tiny local systems immediately. It implies that model selection should be treated as a moving efficiency frontier, not as a permanent hierarchy.
Today’s frontier model may be best for difficult reasoning, complex tool use, high-stakes drafting, long-context synthesis, or domains where reliability gaps are costly. But today’s cheaper model may become good enough for large categories of classification, extraction, customer support, internal search, summarisation, routing, report generation, and low-risk automation. The moat erodes task by task, not by press release.
The paper’s mechanism therefore changes how model portfolios should be managed. Instead of asking which single provider is “best,” operators should track where frontier advantage still exceeds its cost:
| Use case type | Likely model strategy if convergence continues |
|---|---|
| Routine language work | Move down-market quickly as cheaper models improve |
| High-volume inference | Optimise for cost, latency, reliability, and batching |
| Regulated or auditable workflows | Prioritise controllability, logs, evaluation, and governance |
| Frontier reasoning or agentic planning | Keep premium-model access, but benchmark frequently |
| Proprietary domain workflows | Invest in data, retrieval, evaluation, and integration rather than model worship |
The paper’s point is not that the meek model is always better. It is that the definition of “good enough” keeps moving upward.
Loss is not capability, but the paper gives three bridges
A fair objection arrives quickly: training loss is not what users buy. Nobody purchases “0.03 fewer nats per token,” except perhaps someone with a suspiciously exciting procurement process. Businesses buy accuracy, reliability, reasoning, code quality, workflow completion, lower cost, faster turnaround, and fewer embarrassing hallucinations.
The paper knows this. It spends substantial effort arguing that loss difference is not merely an abstract training metric. It offers three bridges from loss to practical capability.
First, it notes that language-model loss has historically tracked model progress. The paper cites GPT-3 davinci loss at 4.36 and GPT-2 large loss at 5.16, using this as one example of loss movement aligning with broader capability improvement. This is not proof that loss equals intelligence. It is evidence that loss is not irrelevant spreadsheet decoration.
Second, it connects loss to benchmark performance. The authors fit MMLU performance as a sigmoid function of inferred loss. The sigmoid shape matters. It says progress may look slow, then rapid, then saturated. In Figure 2, MMLU performance rises sharply over a particular loss range and then approaches an upper plateau around 80% in their fitted data. Figure 3 then translates the loss-gap model into benchmark-gap dynamics: the SOTA model’s benchmark advantage grows, peaks, and then declines.
This is main evidence, but not a universal law. MMLU is a benchmark, not a full theory of useful intelligence. The paper acknowledges that benchmark saturation can make models appear more similar than they are. When a test has a ceiling, convergence on the test may simply mean the test has run out of headroom. Anyone who has watched vendors boast about benchmark deltas already knows this ritual. The paper’s useful move is to show that even under a benchmark translation, the same rise-and-fall inequality pattern appears.
Third, the paper uses an information-theoretic distinguishability argument. If two models differ in loss by $\Delta L$, then the expected number of tokens needed by an ideal observer to distinguish the better predictor grows as $\Delta L$ shrinks. In the paper’s sequential probability ratio test framing, smaller loss differences require more evidence to tell the models apart. Figure 8 shows the token threshold rising as loss difference falls. Figure 9 applies this over time: as the SOTA-meek loss gap narrows, distinguishing the models on ordinary text requires longer samples.
This is elegant because it changes the question from “Is model A better?” to “How much interaction does it take before the difference matters?” In a production system, that is exactly the question. If users cannot reliably observe the difference across normal tasks, the premium model may still be technically superior but commercially overqualified.
What the figures are actually doing
The paper’s figures are not all serving the same purpose. Some are the main argument. Some are translations. Some are sensitivity checks. Treating them all as equal “evidence” would be lazy, and we are professionals here, allegedly.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Figure 1: training loss difference over time | Main mechanism | Diminishing returns can make SOTA-meek loss advantage peak and decline | That every capability gap closes on the same schedule |
| Figures 2–3: MMLU sigmoid and benchmark-gap translation | Main evidence / proxy translation | Loss-gap dynamics can map onto benchmark performance differences | That MMLU captures all economically valuable capability |
| Equation 8 and Figures 8–9: distinguishability via SPRT | Theoretical robustness / interpretation | Smaller loss gaps require more tokens to distinguish under the model assumptions | That specialised adversarial prompts cannot expose differences faster |
| Figure 4: inference versus training loss gap | Main extension | Inference-cost improvements may shrink practical access gaps faster than training convergence | That all inference-time scaling behaves like training-time scaling |
| Figure 5: Artificial Analysis trends | Exploratory empirical check | Commercial leaderboard data show suggestive inference-price convergence | That historical data conclusively validate the model |
| Figures 6–7: varying investment growth and initial capital | Robustness / sensitivity | Compute-investment differences affect the peak but do not eliminate convergence in the model | That parameter choices are settled or future scaling regimes are stable |
| Figure 10: investment stagnation scenarios | Robustness / sensitivity | Even large changes in future investment trajectories may have limited effect once diminishing returns dominate | That energy, data, or hardware bottlenecks are irrelevant |
This distinction is important because the paper is partly theoretical. Its strongest contribution is not an empirical proof that model inequality has already vanished. It is a mechanism showing why, under specific assumptions, raw compute advantage becomes less durable than expected.
The empirical section is appropriately more cautious. Using Artificial Analysis leaderboard data, the authors compare the best overall model with the best model available in a fixed inference-price range of roughly $0.5$–$1$ per million tokens. They report suggestive convergence. Yet for training inequality, they rely on parameter count as a proxy for training compute, which they openly call rough. Parameter count is especially awkward now because modern models may be overtrained, distilled, quantised, sparsified, or wrapped in inference-time reasoning systems. The parameter number on the tin is not the capability mechanism inside the machine.
That does not weaken the paper’s conceptual contribution. It does limit how aggressively one should forecast dates.
Inference economics may compress the gap faster than training economics
Most companies will never train a frontier model. They will rent, host, fine-tune, distil, quantise, route, cache, and occasionally pretend this constitutes “AI transformation.” So the paper’s inference section may matter more operationally than the training model.
The authors model a meek user with a fixed inference budget, such as $10^{-8}$ dollars per token. They decompose inference affordability into factors such as FLOPs per dollar, parameters per FLOP, and effective parameters per actual parameter. This bundles improvements from hardware, KV caching, sparse attention, speculative sampling, distillation, overtraining, and broader algorithmic progress.
For a rough estimate, the paper uses a conservative inference-price improvement trend of $9\times$ per year. The key insight is nonlinear. If inference cost falls by a factor of 10, a user can run a model 10 times larger at the same token budget. Under Chinchilla-style assumptions, that can correspond to a model trained with roughly $100\times$ the compute of the smaller model, because compute-optimal training scales with both model size and data.
Figure 4 shows the result: the inference-based loss difference falls much faster than the training-based loss difference. This is exactly what businesses have already started to feel in less formal language: last year’s expensive capability becomes this year’s API tier, and then next year’s open-weight deployment option. The frontier moves, but the usable middle improves faster than procurement cycles.
The appendix adds another practical clue. The authors connect their distinguishability model to speculative sampling. Speculative sampling uses a cheap draft model to generate many tokens, while a larger target model verifies or corrects them. This works better when the draft model is usually close to the target. If meek and frontier models become harder to distinguish on many ordinary tokens, speculative decoding becomes not just a clever optimisation, but a structural consequence of convergence.
That has a direct business reading. The future AI stack may not be “one giant model answers everything.” It may be a routing and verification system where cheap models handle most tokens, expensive models intervene only where the loss gap still matters, and the system learns to spend intelligence selectively. Less glamorous, more profitable. Tragic for keynote slides, good for margins.
The moat moves from model access to operating advantage
If the paper’s mechanism is directionally correct, the commercial implication is not that model companies are doomed. It is that model access alone becomes less defensible.
The expensive frontier model still matters. It may define new capabilities, generate training data, serve as a teacher model, enable high-end reasoning, and set expectations for what downstream products should do. But if those capabilities diffuse into cheaper models, then durable enterprise value shifts toward the parts of the system that do not automatically become cheaper with global algorithmic progress.
That includes:
| Business layer | Why it becomes more important if model capability commoditises |
|---|---|
| Proprietary workflow integration | The model is easy to replace; the process map is not |
| Domain data and permissions | Useful private context remains scarce even when intelligence is cheap |
| Evaluation and monitoring | Good-enough models still fail; knowing when they fail becomes the product |
| Compliance and audit trails | Regulated buyers care about accountability, not just benchmark scores |
| Latency and cost engineering | High-volume AI margins depend on routing, caching, batching, and fallback design |
| User distribution | Cheap intelligence still needs a channel to reach real demand |
| Trust and change management | Organisations adopt systems, not loss curves |
This is the part many AI vendors quietly dislike. If model quality converges for ordinary tasks, then the market stops rewarding generic “powered by frontier AI” claims and starts asking impolite questions about implementation. Does it reduce headcount hours? Does it integrate with the ERP? Does it preserve audit logs? Can it handle the customer’s messy documents? Does it know when to escalate? Does it lower cost per resolved case? Can legal approve it without needing a small ceremony?
The paper’s argument therefore supports a more disciplined AI investment thesis. Frontier access is a capability input. It is not a business model. The business model comes from turning capability into repeatable, measurable, defensible operations.
Governance has a window problem
The paper also draws governance implications, and they are less comforting than the democratic title might imply.
If only a few large actors can access powerful models, governance can focus on frontier labs, compute thresholds, export controls, and licensing obligations. That is administratively convenient. It is also fragile if capability diffuses downward.
The authors describe a “governance window”: a period when large organisations have a meaningful capability advantage before powerful models become widely accessible. During this window, regulators and trusted organisations may be able to study risks, build safety practices, and impose targeted oversight. But the same window also allows concentrated actors to accumulate influence before access broadens. Centralisation is useful right until it becomes the problem. A tidy little policy paradox, because apparently AI governance needed more of those.
The paper points out that US and EU policy discussions often use training-compute thresholds, such as $10^{26}$ and $10^{25}$ FLOPs, as regulatory anchors. Its argument challenges the durability of that approach. If algorithmic progress, distillation, inference scaling, and cheaper hardware allow lower-budget systems to approach frontier capabilities, then regulating only the original training run misses the diffusion pathway.
For businesses, the governance lesson is operational. Compliance cannot be built only around vendor selection. It must follow capability through the system: model routing, fine-tuning, retrieval sources, tool permissions, logging, human escalation, and output controls. The risk may not sit in one giant model. It may sit in the workflow assembled around several increasingly capable cheap ones.
Where the convergence argument can break
The most important boundary is that the paper’s core model is strongest under fixed-distribution next-token learning. That is a serious boundary, not a decorative disclaimer.
Reinforcement learning and synthetic data change the question from “How well does the model learn the human-text distribution?” to “What distribution is the model being trained to master?” If models learn from self-generated tasks, tool-using environments, simulations, games, automated theorem proving, code execution, robotics feedback, or adversarial interaction, the fixed-distribution picture may no longer hold.
The paper is explicit about this. In adversarial settings, small loss differences can produce large outcome differences. A model that is only slightly better may find edge cases, exploit unfamiliar situations, or compound small advantages across many steps. Competitive domains are not polite benchmark exams. They are closer to games, markets, cybersecurity, negotiation, and scientific search, where tiny differences can matter because opponents adapt.
Multi-step tasks create another complication. The paper models tasks requiring multiple correct benchmark answers by raising benchmark performance to a power $p$. As $p$ increases, the advantage period for larger models lengthens. This is intuitive. If a workflow requires ten consecutive correct reasoning steps, a small per-step error gap can become operationally large. In production, that means convergence for single-turn general tasks does not automatically imply convergence for long-horizon agents.
Specialised knowledge also weakens the distinguishability argument. The paper notes that fewer tokens may be needed to distinguish models if the test is narrowed to specialised knowledge. Ordinary text may make models look similar; targeted expert prompts may reveal gaps quickly. This matters for law, medicine, engineering, finance, cyber operations, and any domain where the cost of subtle error is not subtle.
So the right conclusion is not “frontier models stop mattering.” It is: frontier advantage becomes more concentrated in harder, longer-horizon, more adversarial, more specialised, or newly trained capability regimes. For everything else, cheaper models keep eating the middle.
How operators should use this paper
The practical response is not to bet the company on a single model tier. The practical response is to build an AI stack that assumes capability will diffuse.
A useful operating framework looks like this:
| Decision | Bad question | Better question |
|---|---|---|
| Model procurement | Which model is the smartest? | Where does extra capability still change the outcome? |
| Automation design | Can AI do this task? | What is the cheapest reliable model-route for each subtask? |
| Data strategy | Can we fine-tune a model? | What proprietary context improves decisions after models commoditise? |
| Risk control | Is this a safe vendor? | Where can the workflow fail, and how do we detect it? |
| ROI tracking | Are we using frontier AI? | Are we reducing cost, latency, error, or cycle time? |
| Governance | Does the model pass policy? | Does the whole system preserve auditability and escalation? |
The most robust architecture is likely heterogeneous. Use smaller or cheaper models for extraction, classification, drafting, routing, summarisation, and routine dialogue. Reserve frontier models for difficult reasoning, ambiguous cases, high-value synthesis, high-risk outputs, or teacher-model roles. Add evaluation, logging, retrieval controls, and fallback paths. Then revisit the routing policy often, because the cost-performance frontier will keep moving.
This is also the right way to read model benchmarks. The question is not whether Model A beats Model B by a few points on a leaderboard. The question is whether that gap survives contact with your distribution, your latency requirements, your error tolerance, your compliance obligations, and your budget. Many benchmark gaps will be worth paying for. Many will not. The invoice will help clarify which is which.
The meek do not inherit the earth. They inherit the profitable middle first.
The paper’s title is deliberately provocative. Its business message is more precise: diminishing returns make raw compute a weaker long-term source of separation under the current scaling paradigm. The frontier can keep advancing while its advantage over modest systems becomes harder to monetise across ordinary tasks.
That is not a small claim. It undermines the lazy assumption that AI markets must remain permanently dominated by whoever trains the largest model. It also undermines the equally lazy counterclaim that open or cheap models will automatically win. Both miss the mechanism.
The future implied by this paper is more uneven. Frontier labs may still matter enormously for discovering new capabilities. But once a capability becomes part of the general modelling frontier, efficiency work can push it downward into cheaper inference, smaller deployments, distilled systems, speculative decoding, and open-weight ecosystems. Power does not disappear. It migrates.
For Cognaptus readers, the strategic lesson is clean enough to be useful: do not build AI advantage on a model-size hierarchy that scaling laws are already compressing. Build it where convergence helps you rather than hurts you. Own the workflow. Own the data interface. Own the evaluation loop. Own the trust layer. Let the model market fight its expensive theological wars in the background.
The meek may not inherit everything. But they may inherit enough to make yesterday’s compute moat look like a very large, very hot, very depreciating asset.
Cognaptus: Automate the Present, Incubate the Future.
-
Hans Gundlach, Jayson Lynch, and Neil Thompson, “Meek Models Shall Inherit the Earth,” arXiv:2507.07931v1, 10 July 2025, https://arxiv.org/abs/2507.07931. ↩︎