Opening — Why this matters now
The AI world has become obsessed with “long-horizon” reasoning—the ability for agents to sustain coherent thought over hundreds or even thousands of interactions. Yet most large language model (LLM) agents, despite their size, collapse under their own memory. The context window fills, noise piles up, and coherence suffocates. Alibaba’s IterResearch tackles this problem not by extending memory—but by redesigning it.
This work reframes AI research agents as strategic forgetters—entities that periodically summarize, compress, and rebuild their cognitive workspace. The result is a system that not only reasons longer, but reasons cleaner.
Background — The limits of infinite memory
Traditional deep-research agents, from OpenAI’s DeepResearch to Perplexity and Gemini, share a simple habit: they dump every retrieved fact and every reasoning step into one ever-expanding context window. The approach, called the mono-contextual paradigm, is intuitive—until it isn’t. The paper calls the consequences context suffocation and noise contamination: too much memory, too little clarity.
Alibaba’s team realized that long-horizon tasks—like complex scientific exploration or multi-step web reasoning—don’t just need more memory; they need better forgetting. The insight: treat reasoning like a Markov Decision Process (MDP), where each step depends only on the current “state,” not the full history.
Analysis — How IterResearch redefines reasoning
At the heart of IterResearch lies a Markovian workspace reconstruction. Instead of hoarding the entire conversation, the agent keeps only three elements per step:
| Component | Function |
|---|---|
| Question (q) | The core goal or query guiding reasoning |
| Evolving Report (Mₜ) | A compressed “memory” summarizing critical findings |
| Immediate Context ({aₜ₋₁, TRₜ₋₁}) | The last action and its tool response |
After each interaction, the workspace is rebuilt—not appended. Irrelevant fragments are discarded, while distilled insights persist through the evolving report. This allows reasoning depth to scale linearly with quality, not context length.
To train this process, the authors devised Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning method that rewards not only correctness but efficiency. Using geometric discounting ($r_t = \gamma^{T-t} R_T$), agents that reach correct answers faster receive higher rewards. This subtle tweak shifts learning incentives away from brute-force exploration toward focused reasoning—a rare instance of elegance in reinforcement learning.
Findings — The power of strategic forgetting
Across six benchmarks, IterResearch outperformed all open-source long-horizon agents by an average of +14.5 percentage points. It also began encroaching on the performance of proprietary systems like OpenAI’s DeepResearch and Google’s Gemini DeepResearch. More strikingly, it demonstrated interaction scaling up to 2048 rounds within a fixed 40K context window—an architectural impossibility for previous approaches.
| Benchmark | Best Open Source | IterResearch (30B-A3B) | Gain (pp) |
|---|---|---|---|
| Humanity’s Last Exam | 20.0 | 28.8 | +8.8 |
| BrowseComp | 17.2 | 37.3 | +20.1 |
| BrowseComp-zh | 29.4 | 45.2 | +15.8 |
| GAIA | 64.1 | 72.8 | +8.7 |
| Xbench-DeepSearch | 56.0 | 71.0 | +15.0 |
| SEAL-0 | 20.7 | 39.6 | +18.9 |
These are not mere statistical bumps—they reflect a structural victory. IterResearch’s constant workspace (O(1) complexity) sidesteps the quadratic attention blow-up of traditional methods, maintaining reasoning quality even as task depth increases.
Even more intriguing, when IterResearch-generated trajectories were used to retrain mono-contextual agents, their performance also rose. In other words, good forgetting produces transferable intelligence.
Implications — What this means for the AI industry
For AI labs, IterResearch represents a philosophical shift: intelligence isn’t about remembering more, but remembering better. Forgetting becomes computation. This matters deeply for three sectors:
- Enterprise AI automation — Agents handling long investigative workflows (due diligence, legal discovery, market analysis) can stay focused without ballooning context costs.
- Scientific and R&D automation — Iterative knowledge synthesis aligns with human research cycles: gather, distill, refocus.
- Agentic infrastructure design — The Markovian paradigm implies new architectures for distributed multi-agent systems, where each sub-agent maintains bounded cognition but shares distilled knowledge.
And the training innovation—EAPO—shows a practical path toward balancing accuracy with resource efficiency, crucial for scaling autonomous systems economically.
Conclusion — The elegance of bounded reasoning
IterResearch’s genius is not in memory expansion, but in memory discipline. By constraining the cognitive horizon to what’s essential, it achieves what human researchers have long practiced: summarization as survival. The study’s lesson is timeless—progress in reasoning comes not from hoarding information, but from knowing what to discard.
Cognaptus: Automate the Present, Incubate the Future.