Forget Me Not: How IterResearch Rebuilt Long-Horizon Thinking for AI Agents

A research workflow usually starts clean.

The first search is sensible. The first source is relevant. The first reasoning step looks promising. Then the agent opens five webpages, follows a few tangents, remembers an early mistake too faithfully, and keeps dragging the whole mess forward like a consultant who refuses to delete old slides. By the time the problem actually becomes difficult, the model is no longer short of information. It is drowning in it.

That is the central problem behind IterResearch, a paper from researchers at Renmin University of China, Tongyi Lab at Alibaba Group, and OpenRLHF.¹ The work is not really about giving AI agents more memory. It is about teaching them to stop treating memory as a landfill.

The paper’s target is “deep research” agents: systems that search, browse, compute, reason, and synthesize over many rounds. These agents are increasingly expected to handle due diligence, technical research, market intelligence, literature review, policy analysis, and other workflows where the answer is not waiting politely in the first document. The obvious engineering instinct is to keep everything: every query, every page, every tool response, every intermediate thought. After all, forgetting sounds risky.

IterResearch argues the opposite. In long-horizon work, remembering everything becomes a failure mode. The agent needs a bounded workspace, a disciplined report, and a way to rebuild its state after each interaction. In business language: the agent needs a research brief, not a transcript dump. A small mercy for anyone who has ever read a 200-page meeting log masquerading as insight.

The real enemy is context accumulation, not task length

The paper names the dominant approach the “mono-contextual paradigm.” The pattern is simple: append all previous reasoning and all retrieved information into one growing context window. This feels natural because chat systems are built around conversational accumulation. More conversation means more context; more context should mean better continuity.

That intuition fails when the task requires many tool interactions.

IterResearch identifies two linked breakdowns. The first is context suffocation. As previous interactions fill the context window, the space available for fresh reasoning shrinks. The agent may still have a long window in theory, but its effective working room is increasingly consumed by old material. The second is noise contamination. Bad leads, irrelevant pages, and early mistaken assumptions do not disappear. They remain in the context and can interfere with later reasoning.

The misconception worth killing early is that longer context windows automatically solve long-horizon agent work. They help, but only up to a point. A larger bin still becomes a bin. The question is not only how much information can be carried. It is whether the agent can separate durable findings from temporary clutter.

This distinction matters because the long-horizon problem is not simply “the task requires 200 steps.” It is “the task requires 200 steps while preserving enough cognitive freshness to make step 199 useful.” That is a different engineering problem.

IterResearch rebuilds the workspace instead of extending it

IterResearch’s main contribution is architectural. Rather than appending everything into one ever-expanding state, it reconstructs the agent’s workspace at each round.

Each round contains three core elements:

Workspace element	Role in the agent	Business translation
Question	The stable objective	The research mandate
Evolving report	Compressed memory of validated findings	The live analyst brief
Immediate context	The latest action and tool response	The newest evidence to process

The agent generates a structured decision with three parts: thinking, report update, and action. The action may be another tool call or a final answer. After the tool returns, the workspace is rebuilt from the original question, the updated report, and the latest interaction. The historical transcript is not carried forward wholesale.

That small design choice is the paper’s centre of gravity.

The evolving report is not a decorative summary. It is the agent’s memory. It must preserve what remains decision-relevant and discard what no longer helps. The immediate context lets the agent react to the latest evidence. The original question keeps the work anchored. Everything else has to earn its place through synthesis.

A simplified view:

Mono-contextual agent:
Question + all prior thoughts + all prior actions + all prior tool responses + new step

IterResearch:
Question + evolving report + latest action/tool response → new report + next action

The difference is not cosmetic. In the mono-contextual design, state size grows with the number of rounds. In IterResearch, the state is intended to stay bounded because the report compresses what matters. The paper’s appendix frames this as a computational complexity distinction: mono-contextual context grows roughly with the number of rounds and average tool response size, while IterResearch keeps a constant workspace based on report size plus the latest tool response.

This is the mechanism-first lesson. IterResearch is not impressive because it has a catchy memory trick. It is impressive because it changes the scaling behaviour of the research loop.

Strategic forgetting is not amnesia

“Forget” is a dangerous word in AI systems. Businesses do not want agents that forget contractual terms, regulatory constraints, client requirements, or source provenance. IterResearch is not advocating careless deletion. It is closer to disciplined compression.

The agent does not discard the past randomly. It forces the past to pass through the evolving report. If an early source contains a useful fact, the report should preserve it. If an early page was irrelevant, the report should let it disappear. If an early hypothesis was wrong, the report can record that correction without carrying the entire path of confusion.

That is closer to how competent analysts work. They do not keep every browser tab open forever because “maybe it matters.” They maintain notes, revise conclusions, and separate live findings from dead ends. The paper turns that human workflow into an agent state design.

The risk, of course, is that compression can lose information. IterResearch’s answer is not a guaranteed theorem of perfect retention. It is end-to-end training: the model learns to update the report so that it preserves task-relevant information while filtering noise. This is important. The report is only as useful as the model’s ability to write it. A badly trained evolving report would be a beautiful filing cabinet full of missing documents.

EAPO rewards correct answers that arrive without wandering

The second contribution is Efficiency-Aware Policy Optimization, or EAPO. The core problem is that deep research tasks often have sparse rewards. You can judge whether the final answer is correct, but it is much harder to assign reliable credit to every intermediate search query or browsing decision.

A naive terminal reward treats two successful trajectories as equal: one that finds the answer in five focused steps and another that stumbles into it after twenty. In production, those are not equal. The second costs more, takes longer, and has more chances to absorb irrelevant noise. It is also exactly the kind of “autonomous research” behaviour that looks impressive in demos and expensive in invoices.

EAPO adapts geometric discounting so that successful shorter trajectories receive stronger learning signals than successful longer ones. Conceptually, the reward at an earlier step is shaped by how far it is from the terminal correct answer:

$$ r_t = \gamma^{T-t} R_T $$

Here, $R_T$ is the final correctness reward, $T$ is the terminal step, and $\gamma$ is the discount factor. The paper reports using $\gamma = 0.995$ in its analysis, creating a pressure toward efficient exploration without adding a separate explicit length penalty.

The second EAPO detail is adaptive downsampling. IterResearch generates training samples at each round of a trajectory, not just one sample per completed trajectory. That creates variable sample counts across questions, which complicates distributed training. Adaptive downsampling trims the training corpus to fit data-parallel requirements while preserving more than 99% of the data, according to the paper.

This is not the glamorous part of the work. It is the plumbing. But plumbing is why systems do not flood the basement.

The evidence is about architecture, not just scores

The paper evaluates IterResearch across six benchmarks: Humanity’s Last Exam, BrowseComp, BrowseComp-zh, GAIA, Xbench-DeepSearch, and SEAL-0. These cover multi-step tool use, web navigation, complex reasoning, long-horizon information seeking, and cross-lingual synthesis.

Before treating the numbers as a scoreboard, it is useful to separate what each experiment is trying to prove.

Evidence component	Likely purpose	What it supports	What it does not prove
Main benchmark table	Main evidence	IterResearch outperforms listed open-source agents across six benchmarks and approaches some proprietary systems	General production reliability across all enterprise workflows
EAPO / GSPO / SFT ablation	Ablation	Reinforcement learning improves over supervised fine-tuning; EAPO maintains accuracy while improving efficiency	That EAPO is the only possible reward design
Mono-Agent comparison	Ablation / paradigm test	Workspace reconstruction is materially stronger than mono-contextual training under controlled conditions	That every mono-contextual system fails in every setting
Cross-paradigm data transfer	Exploratory extension	IterResearch trajectories can improve a mono-contextual agent when added to training data	That mono-contextual architecture becomes equivalent to IterResearch
Interaction scaling to 2048 turns	Robustness / scaling test	Bounded workspace allows much longer exploration than conventional accumulation	That longer interaction budgets always help every task
IterResearch as prompting	Comparison with prior prompting style	The same iterative structure improves o3 and DeepSeek-V3.1 over ReAct on tested tasks	That prompting alone replaces trained agent architecture

This distinction matters. A weaker article would simply list the numbers and declare a revolution. A better reading asks what each test isolates. The strongest claim is not “IterResearch wins benchmarks.” The stronger claim is: the architecture changes the failure mode of long-horizon agents by making the workspace bounded and reconstructive.

The benchmark results are strong, but their meaning is narrower than the headline

In the main results, IterResearch-30B-A3B reports the following accuracies: 28.8 on HLE, 37.3 on BrowseComp, 45.2 on BrowseComp-zh, 72.8 on GAIA, 71.0 on Xbench-DeepSearch, and 39.6 on SEAL-0. The authors report an average improvement of +14.5 percentage points over existing open-source agents across the six benchmarks.

That is a substantial result. It is also worth interpreting by task type.

On information-seeking benchmarks such as BrowseComp, BrowseComp-zh, and SEAL-0, the advantage is easy to explain through the mechanism. These tasks require repeated search, browsing, filtering, and synthesis. A mono-contextual agent accumulates more and more raw material. IterResearch keeps converting the process into an updated report. That directly targets the context bloat problem.

On complex reasoning benchmarks such as HLE, GAIA, and Xbench-DeepSearch, the paper’s interpretation is slightly different. The benefit is less about raw browsing volume and more about noise control. If irrelevant or erroneous information remains permanently in the context, later reasoning can be distorted. Iterative report reconstruction creates natural checkpoints where the agent can filter what survives.

The paper also compares against proprietary deep-research systems where public benchmark numbers are available. IterResearch surpasses OpenAI DeepResearch on HLE and BrowseComp-zh in the reported table, while achieving comparable results on BrowseComp and GAIA. That comparison is useful but should not be overread. The proprietary rows are incomplete across several benchmarks, and product systems differ in hidden tools, prompts, retrieval infrastructure, and evaluation settings. The fair business takeaway is not “open-source has beaten commercial deep research.” It is “a 30B-class trained agent with a better state design can narrow parts of the gap.”

That is still quite enough to be interesting. No need to add fireworks.

The ablations show that the workspace design carries most of the argument

The ablation table is where the paper becomes more convincing.

First, EAPO improves the average score to 49.1 compared with 48.3 for IterResearch trained with standard GSPO and 45.5 for the supervised fine-tuned version. The accuracy gap between EAPO and GSPO is not dramatic, but the paper reports that EAPO reduces average interactions by 5.7% compared with GSPO while maintaining or improving accuracy. That is the right kind of efficiency gain: not “do less and get worse,” but “wander less and keep performance.”

Second, the paradigm ablation is more important. The paper compares against a mono-contextual agent using identical training data and external environment. IterResearch outperforms that Mono-Agent by an average of 12.6 percentage points across all six benchmarks. This is the cleaner test of the paper’s thesis: the workspace reconstruction mechanism is not just a formatting preference. It changes performance under controlled conditions.

Third, the cross-paradigm transfer result adds an interesting wrinkle. A Mono-Agent trained with IterResearch-generated trajectories improves from an average of 36.5 to 41.9, a +5.4 point gain. That suggests IterResearch is not merely producing different context management. It may also produce better exploration behaviour: better searches, better intermediate decisions, better paths through the evidence space.

Still, the transfer result is not the main thesis. It is a useful side signal. The primary story remains the architecture.

Interaction scaling is the paper’s sharpest business clue

The interaction scaling experiment is the most business-relevant part of the paper because it addresses a practical question: can an agent keep working when the task takes far longer than expected?

The authors test BrowseComp on a 200-question subset while increasing the maximum allowed turns from 2 to 2048. The figure labels show accuracy rising from 3.5% at 2 turns to 42.5% at 2048 turns, using a constant 40K token workspace. The paper’s surrounding prose contains a higher numerical phrasing, but the abstract, conclusion, and plotted figure labels align on the 3.5% to 42.5% range, which is the safer reading.

The second result is just as important: even with a 2048-turn budget, the agent uses only 80.1 turns on average. In other words, the budget is a ceiling, not a command. The agent is not simply consuming every possible interaction because it can. It appears to terminate when it has gathered enough information.

For enterprise use, that is the difference between autonomy and expensive flailing.

A long-horizon research agent deployed in legal discovery, procurement analysis, customer incident investigation, or technical due diligence will not know in advance how much searching is enough. Some questions need three sources. Some need thirty. Some need a careful chain through obscure documentation. A bounded workspace design allows the system to grant a high maximum budget without forcing each task to pay the full cost.

This is where IterResearch becomes operationally interesting. The practical value is not only higher benchmark accuracy. It is giving agents room to explore while making each round’s context cost more predictable.

Prompting results suggest the pattern is portable

The paper also tests IterResearch as a prompting strategy without training. It compares the iterative report-and-reconstruct pattern against ReAct-style mono-contextual prompting using o3 and DeepSeek-V3.1.

The results are consistently positive across HLE, BrowseComp, BrowseComp-zh, and GAIA. On BrowseComp, the reported improvements are especially large: o3 improves from 49.7 to 62.4, a +12.7 point gain, while DeepSeek-V3.1 improves from 30.0 to 49.2, a +19.2 point gain.

This matters because it separates two layers of the contribution.

The trained IterResearch agent is one contribution. But the underlying cognitive structure—maintain a report, process the latest evidence, rebuild the workspace—also appears to help frontier models when used as a prompt pattern. That makes the paper relevant beyond teams training their own agents from scratch.

For businesses, this is the near-term pathway. Most firms will not immediately reproduce the full training pipeline, synthesize 110K trajectories, run supervised fine-tuning, and then perform reinforcement learning on selected medium-difficulty questions. They may, however, redesign agent prompts and orchestration layers around iterative report reconstruction.

That does not make prompting equivalent to training. It does make the architecture testable.

What Cognaptus would infer for enterprise agent design

The paper directly shows benchmark gains, ablation improvements, interaction scaling, and prompting benefits under the authors’ evaluation setup. The business interpretation is an inference, not a result the paper itself proves.

The inference is straightforward: enterprise research agents should treat the live context window as a working desk, not an archive.

A practical implementation would separate four layers:

Layer	Purpose	Why it matters
Raw evidence store	Preserve full tool outputs, URLs, documents, and intermediate logs outside the model context	Auditability and source recovery
Evolving report	Maintain the compressed working memory inside the model context	Focus and continuity
Immediate context	Present only the latest interaction for local reasoning	Responsiveness to new evidence
Governance layer	Track confidence, unresolved questions, source quality, and termination criteria	Reliability and reviewability

This distinction is crucial. IterResearch does not imply businesses should delete evidence. Regulated workflows still need logs, citations, provenance, and review trails. The point is that the model should not be forced to reason over the entire archive at every step.

For due diligence, that means the agent keeps a concise risk brief while the full document dump remains retrievable. For technical support, it means preserving the current diagnosis rather than dragging every failed hypothesis into each new reasoning step. For market intelligence, it means updating a living synthesis as new sources arrive. For legal or compliance work, it means separating evidence retention from active reasoning context.

That separation may sound obvious. In many agent prototypes, it is not. The prototype just appends. Then it appends again. Then everyone acts surprised when the agent becomes confused, verbose, and weirdly loyal to a bad early assumption. Machines, it turns out, can also suffer from meeting fatigue.

The limitations are practical, not fatal

The paper is strong, but it is not a blank cheque for enterprise deployment.

First, the results are benchmark-based. BrowseComp, HLE, GAIA, Xbench-DeepSearch, and SEAL-0 are useful stress tests, but they are not the same as messy internal company environments with permissions, stale documents, political incentives, and adversarial vendor PDFs. Real organisations are rarely as clean as benchmark suites. This will shock absolutely nobody who has opened a shared drive.

Second, the main trained system uses Qwen3-30B-A3B as the backbone, with trajectory synthesis involving Qwen3-235B-A22B. The architecture may generalise, and the prompting experiments support portability, but production behaviour will still depend on the base model, tool quality, orchestration, retrieval environment, and evaluation policy.

Third, answer correctness is evaluated partly through LLM-as-judge methods using Qwen3-235B-A22B. That is common in agent research, but it matters. LLM judges are useful, scalable, and imperfect. They can miss subtle errors, especially in domains where correctness depends on precise legal, medical, financial, or scientific interpretation.

Fourth, the token budget table excludes tool-response tokens. The paper reports average generation tokens for the agent’s own thinking and report steps: about 31K for HLE, 376K for BrowseComp, 81K for BrowseComp-zh, 33K for GAIA, 31K for Xbench-DeepSearch, and 28K for SEAL-0. That helps isolate internal reasoning cost, but it does not represent total end-to-end runtime or API expense in a deployed web-research system. Tool calls, page extraction, search APIs, and storage still count in real budgets, even if they politely stand outside the reported column.

Fifth, strategic forgetting introduces a new quality-control problem: report faithfulness. If the evolving report drops a crucial caveat, overcompresses a source, or carries forward a mistaken conclusion, later rounds inherit that failure. IterResearch reduces raw context noise, but it increases the importance of summary quality. The cure for drowning is not dehydration.

These boundaries do not weaken the paper’s core idea. They define where the idea has to be engineered carefully.

The future agent stack will look less like chat and more like research operations

IterResearch points toward a broader shift in agent design. The first generation of tool-using agents looked like chatbots with plugins. The next generation will look more like workflow systems with explicit state management.

That matters because enterprise AI problems are usually not solved by a single clever answer. They are solved through managed investigation: define the question, gather evidence, update the brief, identify uncertainty, decide whether more search is needed, and produce a conclusion with traceable support.

IterResearch formalises that rhythm. It gives the agent a stable objective, a compressed evolving memory, and a bounded immediate workspace. It trains the agent not only to be correct, but to avoid unnecessary wandering. It demonstrates that the same structure can improve prompting performance on frontier models. And it shows that long interaction budgets become more useful when the agent is not forced to carry its entire past in active context.

The strategic lesson is clean: long-horizon reasoning is not won by hoarding tokens. It is won by controlling what gets promoted from evidence to memory.

For businesses building research agents, the paper’s message is refreshingly unsentimental. Keep the archive. Keep the audit trail. Keep the sources. But do not make the model reread the entire attic every time it needs to think.

Sometimes intelligence is knowing what to remember.

Sometimes it is knowing what to leave out of the next prompt.

Cognaptus: Automate the Present, Incubate the Future.

Guoxin Chen et al., “IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling,” arXiv:2511.07327, https://arxiv.org/abs/2511.07327. ↩︎

The real enemy is context accumulation, not task length#

IterResearch rebuilds the workspace instead of extending it#

Strategic forgetting is not amnesia#

EAPO rewards correct answers that arrive without wandering#

The evidence is about architecture, not just scores#

The benchmark results are strong, but their meaning is narrower than the headline#

The ablations show that the workspace design carries most of the argument#

Interaction scaling is the paper’s sharpest business clue#

Prompting results suggest the pattern is portable#

What Cognaptus would infer for enterprise agent design#

The limitations are practical, not fatal#

The future agent stack will look less like chat and more like research operations#