Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

Budget is the most comforting word in enterprise AI.

Give the agent a bigger context window. Give it more tool calls. Give it more time. Give it a notebook, a browser, a Python interpreter, a reminder to “think step by step,” and perhaps a small motivational speech about being thorough. Surely the system will become more reliable.

UltraHorizon is a useful little insult to that assumption.¹ The benchmark does not ask whether an agent can answer one hard question, retrieve one hidden fact, or survive a tidy five-turn workflow. It asks whether the agent can keep investigating when the rules are hidden, the environment is only partially observable, and success depends on building and revising hypotheses over many interactions.

That distinction matters. A short-horizon model can look competent because the task structure carries it. A long-horizon agent has to carry the structure itself. It must decide what to test, remember what it learned, avoid confounding its own experiments, update its theory, and know when confidence is earned rather than merely felt. This is where current agents begin to look less like tireless analysts and more like interns with infinite stationery.

The paper’s central message is not simply that agents score badly. That would be boring, and there are already enough benchmark leaderboards in the world performing that particular ritual. The more important finding is mechanical: long-horizon failure emerges from exploration drift, early assumption lock-in, memory decay, tool misuse, and poor calibration of when to continue versus stop. Bigger budgets help only when the agent has a process capable of using them. Otherwise the extra context becomes a larger attic in which to misplace the evidence.

UltraHorizon tests investigation, not answer production

UltraHorizon introduces three synthetic environments designed around hidden-rule discovery. The environments are not meant to mimic one business domain directly. They are stress tests for a shared operating pattern: investigate an opaque system, run experiments, accumulate evidence, and submit a mechanistic explanation.

Environment	What the agent must discover	Capability being stressed	Why it resembles business work
Mystery Grid	The hidden effects of letters A–E in a 10×10 grid, including dependencies on step count, position, visit count, and energy	Spatial exploration, controlled resets, memory of observations, rule induction	Operational diagnosis, process mining, incident investigation
Sequence Exploration	Five hidden transformation rules applied to paired letter sequences	Symbolic experimentation, hypothesis isolation, systematic variation	Data pipeline debugging, fraud-pattern discovery, code or workflow analysis
Alien Genetics Laboratory	Inheritance rules for triploid organisms, including dosage effects, dominance, and lethal combinations	Scientific reasoning, controlled crosses, long-term evidence management	R&D automation, biomedical analysis, structured investigative research

The useful design choice is that these are not ordinary puzzle tasks with all the facts laid out. The agent begins without the rules. It has to interact with the environment, use tools, write and consult notes, and eventually commit a final explanation.

That makes UltraHorizon closer to the kind of enterprise agent people keep promising: the one that does not merely answer a question, but works through a messy investigation. Investment research, compliance review, software maintenance, incident response, procurement anomaly detection, scientific literature exploration — all of these are partially observable, evidence-heavy, and vulnerable to early false theories. Not coincidentally, they are also places where “just let the agent run longer” sounds seductive in a procurement meeting.

The benchmark’s scale is deliberately uncomfortable. The authors report that standard configurations still exceed 35,000 tokens and more than 60 tool calls on average, while the heaviest setting reaches more than 200,000 tokens and more than 400 tool calls. This is not the usual toy interaction with a polite tool call or two. It is a long enough trace for the agent’s working process to become the object under examination.

The main result is a process gap, not merely a score gap

The headline evidence is straightforward: tested LLM agents underperform humans across UltraHorizon. The paper evaluates Gemini-2.5-Pro, GLM-4.5, DeepSeek-V3, Kimi-K2, and Qwen3-235b under fixed-step and free-step settings.

In the fixed-step setting, the agents receive a defined exploration budget: 50 steps for Mystery Grid and Sequence Exploration, and 25 for Alien Genetics Laboratory. Human participants score substantially higher. The appendix reports human averages of 25.88 for Mystery Grid, 24.29 for Sequence Exploration, and 47.50 for Genetics Laboratory, with an overall human average of 26.52 compared with the best LLM average of 14.33.

Those numbers are not best read as “humans are magical.” They say something more operationally useful: humans are better at maintaining an investigation state. They notice when an experiment is redundant. They abandon weak hypotheses faster. They are less likely to keep poking the same corner of the problem because an early pattern felt plausible.

The model results also vary by environment. Sequence Exploration is especially hard for all models, even though its world is deterministic. That is an important detail. The agents are not failing because the environment is noisy or because the rules randomly change behind their backs. They are failing because deterministic evidence still has to be gathered in a disciplined way. Determinism is not a cure for bad experiment design. Painful, yes, but apparently necessary.

The free-step setting adds a second lesson. When agents are allowed to decide when they have explored enough, performance does not simply improve. Gemini-2.5-Pro improves in Mystery Grid and Genetics Laboratory, suggesting that fixed budgets can be too restrictive. But Qwen3-235b drops sharply in Sequence Exploration, which the authors interpret as a sign of overconfidence or premature stopping. In other words, autonomy over the exploration budget is itself a capability. It is not a free feature bundled with tool access.

That point travels well into business deployments. An agent that decides when to stop investigating needs evidence thresholds, not vibes. Otherwise it will either quit early with a charmingly confident wrong answer or keep wandering through logs, documents, and APIs until the bill becomes the most accurate output.

The ablation shows horizon length becoming the bottleneck

The paper’s cleanest diagnostic result is the horizon-level ablation in Mystery Grid. The authors vary the number of hidden rules from one to five and normalize the score. The point is to separate “this individual rule is hard” from “sustaining the investigation across many hidden rules is hard.”

The result is ugly in a useful way. GLM-4.5’s normalized score drops from 34.4 at one hidden rule to 5.62 at five hidden rules. Average tool calls rise from 45.53 to 87.97. More interaction is happening, but the quality of inference is not keeping up.

That is the long-horizon problem in miniature. The difficulty does not scale linearly with the number of facts. It compounds through state management. Each new hidden rule increases the burden on the agent’s memory, experimental design, hypothesis tracking, and contradiction handling. By the time the agent is juggling several possible mechanisms, it is no longer just solving a rule. It is managing a research programme. Some agents appear to manage it about as well as a committee with no minutes.

This matters because many business workflows have the same compounding structure. A compliance agent may need to connect policy language, transaction records, user explanations, exception rules, historical precedents, and missing documents. A software agent may need to track dependency changes, failing tests, architectural constraints, and partial patches. A research agent may need to reconcile papers that use different methods, datasets, and definitions. The hard part is not one observation. It is keeping the whole inferential state alive without letting yesterday’s guess become today’s doctrine.

Simple scaling fails because the agent cannot digest the extra evidence

The scaling experiments test another tempting belief: if the agent fails, give it more steps.

The authors vary maximum exploration steps for GLM-4.5 across the three environments. Naive scaling does not reliably improve performance. Mystery Grid improves up to a point, peaking at 125 steps before dipping. Alien Genetics performs best at the shortest tested budget, then deteriorates before later partial recovery under the authors’ context-refresh strategy. Sequence Exploration stays weak across most naive budgets.

The paper’s proposed mitigation is Context Refresh with Notes Recall, or CRNR. When the interaction history approaches the context limit, prior dialogue is cleared except for the system prompt, and the agent is instructed to review its own notes. This is a lightweight context-management intervention: throw away the clutter, keep the notebook, and ask the agent to reconstruct the relevant state.

CRNR improves some scaling outcomes, especially where long context appears to become a liability. But the interesting business lesson is not “CRNR solves agents.” It does not. The stronger lesson is that context must be curated. Raw history is not memory. It is sediment. Sometimes useful, often heavy, and rarely self-organising.

Test in the paper	Likely purpose	What it supports	What it does not prove
Fixed-step and free-step model evaluations	Main evidence	Current agents underperform humans and struggle with autonomous exploration calibration	Exact production failure rates for enterprise agents
Trace statistics	Main evidence on scale	UltraHorizon creates genuinely long interaction traces with substantial tool use	That longer traces are automatically more productive
Horizon-level ablation	Ablation	Performance degrades as hidden-rule burden increases	That every domain will show the same slope
Step-scaling experiments	Sensitivity test	More steps can help, hurt, or waste resources depending on strategy	That budget size alone is a reliable improvement lever
CRNR scaling	Implementation-oriented intervention	External notes plus context refresh can reduce context overload	That note recall is sufficient for robust long-horizon autonomy
score@32 appendix results	Robustness-style extension	Best-case aggregation confirms broad hierarchy and strategy differences	That average single-run behaviour is fixed by cherry-picking best trials
Failure taxonomy and case studies	Exploratory diagnosis	Failures cluster around lock-in, memory, planning, tool use, and experimental control	A complete causal decomposition of all agent failures

The table matters because the paper contains several kinds of evidence. The main experiments establish the performance gap. The ablation sharpens the horizon argument. The scaling tests attack the “more budget” intuition. The appendix extends and diagnoses rather than replacing the main thesis. Treating all of this as one undifferentiated blob of “benchmark results” would be the traditional academic-summary mistake. We can do slightly better. On good days.

In-context locking is the paper’s most useful failure mechanism

The authors identify two root causes behind agent breakdown: in-context locking and foundational capability gaps.

In-context locking is the more distinctive concept. It describes the process by which agents become anchored to early patterns, assumptions, or habits, then continue exploring within that narrowed frame. The agent is still active. It may still call tools, write notes, and produce elaborate plans. But the search has lost strategic elasticity. It is moving, not learning.

The paper uses token entropy dynamics as one signal for this phenomenon. Across the three environments, GLM-4.5’s median token entropy tends to decline as sequences progress, with a late-stage uptick near final answer production. The authors interpret the decline as evidence that the agent’s behaviour narrows over time. That is plausible, though not definitive. Entropy is a proxy, not a mind-reading device. Still, it matches the trajectory case studies: agents repeat habits, converge too early, misuse tools after feedback, and fail to revise stale internal models.

The second root cause, foundational capability gaps, is broader. It covers weaknesses in logical reasoning, memory management, tool use, and planning. This is less elegant but no less important. Some failures are not caused by early lock-in. The agent simply lacks a reliable internal routine for controlled experimentation or world-model maintenance.

The concrete taxonomy in the paper lists eight failure manifestations:

Failure manifestation	Operational reading
Premature convergence	The agent decides too early that a weak hypothesis is “good enough.”
Repetitive looping	It repeats actions that no longer add information.
Error propagation	A mistake survives feedback and contaminates later steps.
Environment mis-modeling	Its internal model of the environment becomes inconsistent or outdated.
Misaligned tool usage	It chooses or calls tools in ways that do not fit the task.
Memory issues	It forgets constraints, observations, or its own plan.
Incoherent planning	Its sequence of actions becomes contradictory or poorly ordered.
Uncontrolled experiments	It changes too many variables at once and cannot interpret the result.

There is a small textual wrinkle: the introduction says “nine” recurring error patterns, while the concrete taxonomy and distribution show eight. That does not change the substance, but it is worth noting because precision is free and benchmark papers should spend more of it.

The most frequent category is premature convergence, at 23.2% in the reported failure distribution. Repetitive looping appears at 15.6%. Error propagation and environment mis-modeling each appear at 13.4%. Misaligned tool usage is 11.0%, memory issues and incoherent planning are 10.0% each, and uncontrolled experiments appear at 3.6%. The percentages should not be treated as universal rates. They are a diagnostic breakdown from this benchmark’s traces. But the pattern is recognisable: agents do not just make isolated wrong moves. They develop bad investigative posture.

The business lesson is to manage the investigation, not worship the context window

The naive enterprise response to this paper would be to ask which model won. That is the least interesting procurement question available, though admittedly not the least expensive.

The better question is: what operating controls would prevent these failure modes from silently accumulating inside a business workflow?

Agent failure mode	Business risk	Useful control
Premature convergence	False confidence in an incomplete review, weak investment thesis, or shallow root-cause analysis	Evidence thresholds before finalisation; forced alternative hypotheses
Repetitive looping	Token burn with little information gain	Information-gain tracking; duplicate-action detection
Memory issues	Lost constraints, repeated requests, inconsistent recommendations	External state store; structured notes; retrieval audits
Uncontrolled experiments	Confounded conclusions in research, testing, or analytics	Experiment templates; one-variable-change discipline
Misaligned tool usage	Wrong API calls, invalid queries, unnecessary computation	Tool-call validation; tool-use policy; error recovery rules
Environment mis-modeling	Persistent mismatch between observations and assumed process	Periodic world-model review; contradiction logs
Error propagation	Early mistake becomes embedded in final output	Checkpoints, backtracking, and explicit correction routines
Incoherent planning	Workflows become busy but directionless	Plan review gates; task decomposition; human escalation triggers

The paper directly shows that current agents struggle in synthetic long-horizon, partially observable rule-discovery environments. Cognaptus’ inference is that enterprise agents should be designed less like single super-prompts and more like managed investigative systems.

That means external memory should not be an afterthought. Notes need structure, timestamps, confidence levels, and links to evidence. Context refresh should be deliberate, not just what happens when the window fills and the oldest messages fall off the cliff. Tool calls should be evaluated for purpose, not merely counted. Exploration should have stopping criteria and continuation criteria. Human checkpoints should occur at moments where the cost of being wrong compounds: after hypothesis formation, before final commitment, and after contradictions emerge.

This is less glamorous than “fully autonomous agent.” It is also more likely to work. A long-horizon agent without investigative controls is not autonomous. It is unsupervised drift with JSON.

Long context is useful only when paired with compression and review

UltraHorizon also clarifies a common misunderstanding about long context. A large context window is not the same as long-horizon competence.

Long context gives the model access to more prior material. Long-horizon competence requires deciding which prior material matters, what it implies, what has been superseded, and which uncertainty remains unresolved. Those are different capabilities. Storing everything is not the same as knowing anything. Your email inbox has already demonstrated this theorem at planetary scale.

The CRNR result is therefore more interesting than it may first appear. Clearing accumulated dialogue while forcing note recall improved some outcomes because it reduced context clutter and made the agent reconstruct a more compact working state. This resembles what competent human analysts do naturally: they do not reread every raw observation before every decision. They maintain a working theory, update it when contradicted, and preserve enough evidence to audit the path.

For enterprise agent design, the implication is not “summarise more.” Bad summaries are just compressed mistakes. The implication is to separate raw trace, working memory, validated findings, unresolved hypotheses, and final claims. Each layer should have a different update rule.

A practical long-horizon agent architecture should therefore ask:

What facts have been observed?
Which facts are verified versus inferred?
Which hypotheses are active?
Which hypotheses were rejected, and why?
What evidence would change the current conclusion?
What tools have already been used, and what did they add?
Has the agent repeated an action without increasing information?
Is the final answer supported by controlled tests or just accumulated confidence?

UltraHorizon does not give a full architecture for this. But it makes the missing architecture visible.

Boundaries: this is a stress test, not a production incident report

The benchmark is synthetic. That is a strength and a limitation.

It is a strength because synthetic environments allow hidden rules, controlled variation, and clear scoring. The authors can vary horizon length, compare fixed and free exploration, and observe failure dynamics without the noise of messy real-world evaluation. If the question is “can agents sustain investigation under controlled hidden rules?”, synthetic design is appropriate.

It is a limitation because business environments contain open-ended objectives, shifting priorities, ambiguous documents, adversarial incentives, privacy constraints, and humans who write requirements as if punctuation were a scarce mineral. UltraHorizon does not tell us exactly how an agent will perform in a live compliance department, investment desk, or software team.

The scoring also uses LLM-as-judge, specifically DeepSeek-R1, with point-wise scoring rubrics. The rubrics are explicit and strict, which helps. But judge-model evaluation is still not the same as independent formal verification. The failure classification combines manual trajectory reading with Gemini-based classification. Useful, yes. Final metaphysics of agent failure, no.

The human comparison is informative but bounded: 33 participants across the environments, not a universal human baseline. The model lineup is also time-specific. Better models may improve scores. Better scaffolding may improve them more. The paper’s own CRNR result suggests that system design can matter substantially.

So the correct interpretation is not “agents cannot do long-horizon work.” It is narrower and more useful: present agents, under these benchmark conditions, struggle to sustain disciplined exploration, and naive increases in context or step budget do not reliably solve the problem. That is enough to change how serious deployments should be designed.

The future agent stack will look more like an operating procedure than a prompt

UltraHorizon is valuable because it shifts attention from answer quality to investigation quality. That is where many enterprise agent failures will live.

The next generation of useful agents will not be defined only by model size, context length, or tool count. Those are inputs. The durable advantage will come from process architecture: memory that distinguishes raw evidence from validated findings, tools governed by purpose and feedback, hypothesis management that resists early lock-in, and escalation points that know when autonomy has become theatre.

This is also where buyers should become more demanding. A vendor demo that shows an agent completing a polished workflow is not enough. Ask how it detects redundant exploration. Ask how it handles contradicted hypotheses. Ask what happens when the context window fills. Ask whether the agent can explain which evidence changed its conclusion. Ask how often it repeats invalid tool calls after receiving an error. Watch carefully. The silence after those questions may be the most accurate benchmark in the room.

The long game is not won by giving the agent a longer leash. It is won by teaching the system how not to trip over it.

Cognaptus: Automate the Present, Incubate the Future.

Haotian Luo et al., “UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios,” arXiv:2509.21766, 2025, https://arxiv.org/pdf/2509.21766. ↩︎

UltraHorizon tests investigation, not answer production#

The main result is a process gap, not merely a score gap#

The ablation shows horizon length becoming the bottleneck#

Simple scaling fails because the agent cannot digest the extra evidence#

In-context locking is the paper’s most useful failure mechanism#

The business lesson is to manage the investigation, not worship the context window#

Long context is useful only when paired with compression and review#

Boundaries: this is a stress test, not a production incident report#

The future agent stack will look more like an operating procedure than a prompt#