Long-Horizon Reasoning

The Chain of Thought Needs a Chain of Custody

TL;DR for operators Two new papers point to the same operational lesson from different sides: long reasoning becomes useful only when its intermediate steps are made explicit, scoped, and checkable. HIPIF tackles the training side of long-horizon agents: it teaches an LLM agent to break tasks into subgoals, fold completed progress into compact memory, reflect on whether a subgoal is done, and use local process rewards to reduce repeated or ungrounded behavior.1 Mask-Proof tackles the evaluation side: it turns research-level mathematical proofs into masked-step tasks where a model must reconstruct a critical formula from self-contained context, then uses a semantic-equivalence judge with repeated voting to grade the result.2 ...

Infinite Tasks, Finite Minds: Why Agents Keep Forgetting—and How InfiAgent Cheats Time

A report is not finished because the model “understands” the assignment. It is finished because the system still knows, two hundred actions later, which documents were read, which notes were trustworthy, which sections remain unfinished, and which half-baked intermediate answer should not accidentally become the final one. That is the boring part of agentic AI. Naturally, it is also the part most systems quietly fail at. ...

Prints Charming: How Reward Models Finally Got Serious About Long-Horizon Reasoning

Search looks simple until it becomes a workflow. A human analyst can open ten tabs, notice which source contradicts which, remember that one earlier search result changed the meaning of the question, and decide whether the next move should be another search, a calculation, or a final answer. An LLM agent can also open tabs, call tools, browse pages, run code, and produce a final answer. The difference is that the agent often does all of this with the discipline of a caffeinated intern who has been told that “more context” is the same thing as “better memory.” ...

Forget Me Not: How IterResearch Rebuilt Long-Horizon Thinking for AI Agents

A research workflow usually starts clean. The first search is sensible. The first source is relevant. The first reasoning step looks promising. Then the agent opens five webpages, follows a few tangents, remembers an early mistake too faithfully, and keeps dragging the whole mess forward like a consultant who refuses to delete old slides. By the time the problem actually becomes difficult, the model is no longer short of information. It is drowning in it. ...

Recursive Minds: How ReCAP Turns LLMs into Self-Correcting Planners

A stuck workflow rarely looks intelligent. It looks like a support agent asking for the same invoice twice, a coding agent editing the wrong file for the third time, or an operations bot patiently repeating an invalid action because, apparently, persistence is cheaper than understanding. This is the unglamorous failure mode of many LLM agents. They do not collapse because they cannot produce a plan. They collapse because the plan becomes stale, buried, or locally contradicted by new observations. The agent remembers the latest step and forgets the job. ...

Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

Budget is the most comforting word in enterprise AI. Give the agent a bigger context window. Give it more tool calls. Give it more time. Give it a notebook, a browser, a Python interpreter, a reminder to “think step by step,” and perhaps a small motivational speech about being thorough. Surely the system will become more reliable. ...