Small Gains, Long Games: Why Tiny Accuracy Bumps Explode into Big Execution Wins

A workflow does not fail because the first step is hard.

It fails because the seventeenth step is boring, the twenty-third step depends on a slightly wrong state, and by the thirty-first step the agent is confidently building on its own rubbish. Very enterprise. Very scalable. Very expensive.

The paper behind this article, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, makes a deceptively simple point: judging LLM progress by short-task accuracy can badly understate the value of reliability gains over long workflows.¹ A model that improves only slightly on a single step may become dramatically better at completing long sequences without failure. That is not motivational poster mathematics. It is compounding.

The sharper contribution is that the authors do not merely say “long tasks are hard.” Everyone has seen an agent wander off in the middle of a workflow like an intern sent to find a stapler in 2009. Instead, they isolate a specific capability: execution. They remove planning and world knowledge from the equation, give the model the data and the plan, and then ask whether it can keep applying the plan correctly over time.

The uncomfortable answer: often, no.

The mechanism: a tiny slip rate becomes a task-length ceiling

The paper begins with a simple model. Suppose a model has single-step accuracy $p$, and suppose each step is independent, constant, and unrecoverable: one error means task failure. Then the probability of finishing $H$ steps is roughly:

$$ p^H $$

For a target success probability $s$, the horizon length is:

$$ H_s = \frac{\ln s}{\ln p} $$

This is intentionally idealised. The authors later show real models violate the constant-error assumption, usually in the annoying direction. But the formula gives the business intuition.

At low accuracy, small improvements are not very exciting. At high accuracy, they become explosive. Moving from “pretty reliable” to “slightly more reliable” can add many more consecutive steps before the expected failure point. That is why “diminishing returns” on short benchmarks can be misleading. A one-point improvement on a one-shot task may look boring; the same improvement inside a 50-step automation pipeline may be the difference between a demo and a deployable process.

The paper’s economic framing is not subtle, and it should not be. If human labour is paid partly because humans can keep doing things over time, then an agent’s useful unit is not just pass@1 on a prompt. It is reliable execution horizon: how long it can keep updating state, retrieving facts, calling tools, and composing results before a human has to intervene.

That distinction matters because most enterprise AI evaluation still behaves as if the unit of work is a question. In operations, the unit of work is a run.

The test removes the usual excuses: no missing plan, no missing knowledge

The authors design a controlled retrieve-then-compose task. The model receives a dictionary mapping five-letter English words to integers. Each turn gives the model one or more keys. The model must retrieve the corresponding values and maintain a running sum.

This sounds trivial because it is meant to sound trivial. The point is not to test whether the model can solve a rich business problem. The point is to remove the confounders that usually muddy agent evaluation.

Component	In real agent tasks	In the paper’s controlled task
Planning	Decide what to do next	Provided as keys to retrieve
Knowledge	Know facts, rules, or tool outputs	Provided in the in-context dictionary
Execution	Carry out the state update correctly	The actual capability being tested
State management	Track what has happened so far	Required through the running sum

This is why the paper is more useful than another “agents fail at long tasks” leaderboard. In realistic benchmarks, failure can mean bad planning, missing information, bad tool use, ambiguous goals, or brittle evaluation. Here, if the model fails, it is much harder to blame strategy. The strategy is handed to it. The facts are handed to it. The model still has to execute.

That is the first managerial translation: long-horizon failure is not always a reasoning failure. Sometimes the agent knows what to do and still cannot keep doing it.

The main evidence: models ace the first step, then decay anyway

In the first main experiment, the authors test Qwen3 and Gemma3 model families across sizes. They set turn complexity to the simplest case: one key per turn. This makes each turn easy. The question is whether easy remains easy after many repetitions.

Most models achieve near-perfect accuracy on the first step, except the smallest 4B models. So the task is not exposing a basic inability to retrieve a number or add an integer. Yet task accuracy drops quickly as the number of turns grows. Even the best model in that experiment, Qwen3-32B, falls below 50% task accuracy within 15 turns.

That is the paper’s cleanest evidence for execution as a separate bottleneck. The model is not failing because it never understood the operation. It is failing because the operation has to be performed reliably over a horizon.

Scaling helps. Larger models sustain accuracy for longer, and the horizon length improves clearly with model size. The authors avoid claiming a formal scaling law because each family only provides a few model sizes. Good. We can all take a moment to appreciate a paper that does not slap “law” on a line segment. Still, the trend is important: even when knowledge and planning are controlled away, larger models are more reliable executors.

The business reading is not “always buy the biggest model.” That would be wonderfully convenient for vendor pricing pages. The reading is narrower: for workflows where each step must be correct and errors are costly to recover, model scale can buy execution horizon even when the individual step looks solved.

The real trap is self-conditioning, not just long context

One might expect per-turn accuracy to stay roughly constant. If the model has a 99% chance of doing each turn correctly, perhaps it should remain around 99% across turns, with the cumulative task success falling because errors compound.

The paper finds something nastier. Turn accuracy itself degrades as the run progresses.

The authors test two explanations. First, maybe this is plain long-context degradation: as the transcript gets longer, performance worsens regardless of content. Second, maybe the model is self-conditioning: it sees its own previous mistakes in the conversation history and becomes more likely to continue the error pattern.

To separate these, they manipulate the history shown to the model. They inject artificial histories with controlled error rates, then measure the model’s accuracy on a later turn. With an error-free history, later-turn performance is still below first-turn performance, which supports a long-context effect. But as the induced error rate in the prior history increases, later accuracy degrades further. That is the self-conditioning effect.

This is the most business-relevant mechanism in the paper. A bad agent trajectory is not just a sequence with one unlucky failure. It can become a feedback loop. The model observes its own flawed state, treats that state as part of the task context, and makes future errors more likely.

Humans also anchor on prior mistakes, of course. The difference is that we do not usually market ourselves as stateless reasoning engines with 128k-token windows.

Bigger models clean up long context, but not their own contaminated history

The paper then makes a useful distinction. Scaling model size appears to reduce the long-context part of the problem. Frontier non-thinking models such as Kimi-K2, DeepSeek-V3, and Qwen3-235B perform near-perfectly at turn 100 when the prior history is healed.

But scaling does not eliminate self-conditioning. When the history contains induced errors, even those larger models degrade.

That distinction should change how teams diagnose failed agent runs. If the problem is long-context degradation, upgrading the model or reducing irrelevant context may help. If the problem is self-conditioning, a larger model may simply become a more fluent participant in its own bad transcript.

The authors’ interpretation is plausible: in-context learning and self-conditioning are cousins. A model is trained to continue from context. Correct demonstrations help; contaminated histories hurt. The same sensitivity that lets models adapt to examples can also make them adapt to their own mistakes. Useful feature, meet your evil twin.

Thinking helps because execution needs sequential computation, not just more samples

The paper’s next finding is that “thinking” changes the picture.

With Qwen3 thinking models, accuracy at turn 100 remains stable even when the prior history contains wrong answers. The authors suggest two possible reasons. Reinforcement learning may make the model more oriented toward task success rather than merely continuing the transcript. Also, thinking models may re-derive the correct state instead of trusting earlier outputs.

This matters because the paper also tests single-turn execution length: how many retrieve-then-compose steps a model can perform before replying once. Without chain-of-thought or thinking, even large non-thinking models struggle to chain more than a handful of steps in one turn. With sequential test-time compute, execution length improves sharply.

The frontier benchmark is striking. In the authors’ single-turn task, GPT-5 thinking, codenamed “Horizon” in the paper, reaches 2,176 steps. Claude-4 Sonnet reaches 432, Grok 4 reaches 384, and Gemini 2.5 Pro reaches 120. Those numbers should not be read as universal agent rankings. They are task-specific, and frontier evaluations use fewer rollouts because of cost. But they show the benchmark can separate models that all look impressive in ordinary demos.

The more durable lesson is not “model X wins.” It is that long-horizon execution rewards sequential computation. The model needs time to work through the sequence, maintain state, and resist being dragged along by its transcript.

The appendix is not decoration; it tells you what not to overclaim

The appendices are useful because they stop the paper from becoming a one-trick benchmark. They test several alternative explanations and mitigation ideas.

Test or appendix result	Likely role	What it supports	What it does not prove
Decomposing retrieval-only, addition-only, and prefix-sum tasks	Diagnostic ablation	Errors come from composing retrieval with state tracking, not from isolated lookup or addition	That all business workflows fail for the same reason
Manipulated error histories	Causal-style ablation	Past model errors in context increase future error probability	That every agent failure is self-conditioning
Qwen3 thinking models under induced error histories	Main mechanism test	Thinking can break the observed self-conditioning loop in this setup	That any “think step by step” prompt solves production reliability
Self-verification prompting	Mitigation test	Prompted checking gives mixed results and can bloat context or create new errors	That verification is useless when external tools or ground truth are available
Sliding context window	Mitigation and sensitivity test	Reducing exposure to error-laden history can improve horizon	That sliding windows work for tasks with long-range dependencies
Majority voting / parallel sampling	Comparison with prior test-time compute strategies	Parallel sampling gives only marginal gains compared with sequential thinking here	That majority voting is weak in all domains
Matrix multiplication task	Robustness test for single-turn sequentiality	Thinking still helps on a more inherently sequential task	That the original addition task is future-proof
Manual analysis of GAIA, ALFWorld, and WebShop failures	Exploratory extension	Self-conditioning-like patterns appear in realistic agent failures	Precise quantitative prevalence, because annotations are subjective

This evidence map matters for practitioners because the obvious fixes are not automatically good fixes.

Self-verification sounds attractive: ask the model to re-check its previous state before each step. The paper finds mixed results. For Gemma3 with chain-of-thought, it helps early but consumes more tokens and collapses sooner as context fills. For Qwen3 thinking models, it gives negligible improvement and can induce overthinking, including arithmetic errors during the verification process itself. There is a grim elegance to an agent failing while checking whether it failed.

Context engineering performs better in the controlled setup. A sliding context window reduces exposure to error-laden history and improves performance. But the authors are careful: their task is Markovian, meaning the next correct state can be represented compactly. Real workflows often have long-range dependencies. You cannot always throw away the past and expect the future to behave.

The practical lesson is therefore not “use a sliding window.” It is “do not let the model’s free-text history become the system of record.”

The enterprise KPI is run length to failure

For business users, the paper shifts evaluation from answer quality to execution durability.

A support triage agent, claims-processing agent, KYC onboarding assistant, finance reconciliation bot, or procurement workflow agent does not create value by answering one question nicely. It creates value by completing a run: gather information, update state, call tools, check constraints, produce an action, and leave behind an auditable trail.

That suggests a different evaluation stack.

Operational question	Better metric	Why this paper points there
Can the model answer this task?	Step accuracy	Necessary but insufficient
How long can it run without intervention?	Horizon length at a chosen success threshold	Captures compounding reliability
Does it degrade after its own mistakes?	Error-conditioned turn accuracy	Detects self-conditioning
Does context help or poison the run?	Accuracy under clean, noisy, and compressed histories	Separates long-context load from transcript contamination
Should we pay for thinking tokens?	Accuracy per unit cost across execution horizon	Measures whether sequential compute buys real reliability
Where should tools take over?	Failure rate by operation type	Reveals whether retrieval, arithmetic, state update, or composition is cracking

This is the part many AI roadmaps skip. They benchmark the model, then design the workflow. For long-horizon automation, the order should often be reversed: define the execution horizon required by the workflow, then benchmark model and architecture choices against that horizon.

A model with slightly lower one-shot accuracy but better long-run stability may be the better production choice. A slower thinking model may be cheaper if it prevents human rework. A compact state ledger may outperform a larger context window. Boring architecture, as usual, defeats heroic prompting. Tragic for the prompt poets, helpful for everyone else.

The architecture implication: externalise state before the transcript becomes folklore

The paper does not directly prescribe enterprise architecture, but the inference is clear.

If self-conditioning arises because models consume their own flawed prior outputs, then production systems should minimise the amount of untrusted model-generated history used as future context. That means externalising canonical state.

A more robust agent loop looks less like:

Let the model read the whole conversation.
Ask it what happened.
Hope it remembers correctly.
Continue.

And more like:

Store canonical state in a database, ledger, workflow engine, or typed object.
Use tools to update state through constrained operations.
Pass the model only the minimal trusted state needed for the next decision.
Treat model free-text as explanation, not authority.
Reconcile against external checks when state changes matter.

This does not eliminate model error. It changes the blast radius. The transcript can still be messy, but it is no longer the sole memory substrate. The model may narrate; the system records.

That distinction is especially important in regulated workflows. A long context window is not an audit trail. It is a large bag of tokens with confidence issues.

What the paper directly shows, and what Cognaptus infers

The paper directly shows that, in a controlled retrieve-then-compose task, LLMs can fail at long-horizon execution even when the plan and knowledge are provided. It shows that small single-step gains can imply large horizon gains under a simple compounding model. It shows that model scale improves execution horizon in the tested families. It identifies self-conditioning as a degradation mechanism beyond long-context effects. It shows thinking and sequential test-time compute can substantially improve long-horizon execution in this setup.

Cognaptus infers that enterprise agent evaluation should include run length to failure, not just task-level prompt accuracy. We also infer that context management, state externalisation, and selective thinking modes are likely to matter more than marginal prompt rewrites when workflows require many consecutive correct updates.

What remains uncertain is the size of these effects in real deployments. Real workflows include ambiguous instructions, recoverable mistakes, tool failures, policy constraints, multiple valid plans, human interruptions, changing data, and cost-latency limits. Some tasks also allow self-correction; a wrong intermediate step may not be fatal if the system can detect and reverse it. The paper’s current task accuracy metric intentionally treats mistakes harshly, which is right for isolating execution but not universal for operations.

So the responsible conclusion is not that this synthetic benchmark is the new enterprise oracle. It is that it reveals a failure mode many teams currently do not measure.

The bottom line: agents need execution design

The paper’s title argues that diminishing returns may be partly an illusion. That is true, but the more useful lesson is operational.

Small accuracy gains matter when work is sequential. Execution is a capability distinct from planning and knowledge. Model history can become contaminated. Bigger models help with some degradation but do not automatically fix self-conditioning. Thinking helps because long execution needs sequential computation. Context engineering helps when it prevents the model from marinating in its own mistakes.

For enterprises, the question is not “Can the model do the step?”

The question is: how many steps can it do before the system becomes expensive again?

That is where long-horizon execution becomes a business metric. Not glamorous. Not demo-friendly. Very much the point.

Cognaptus: Automate the Present, Incubate the Future.

Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping, “The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs,” arXiv:2509.09677, https://arxiv.org/abs/2509.09677. ↩︎

The mechanism: a tiny slip rate becomes a task-length ceiling#

The test removes the usual excuses: no missing plan, no missing knowledge#

The main evidence: models ace the first step, then decay anyway#

The real trap is self-conditioning, not just long context#

Bigger models clean up long context, but not their own contaminated history#

Thinking helps because execution needs sequential computation, not just more samples#

The appendix is not decoration; it tells you what not to overclaim#

The enterprise KPI is run length to failure#

The architecture implication: externalise state before the transcript becomes folklore#

What the paper directly shows, and what Cognaptus infers#

The bottom line: agents need execution design#