Tools are where agent demos go to die.
The pitch is usually elegant. Give the model a goal, attach a few APIs, let it reason, and watch the automation glide across systems like a tiny consultant with no calendar conflicts. Then the real world appears: too many tools, unclear documentation, stale context, partial failures, long interaction histories, and the occasional API response that seems to have been designed by someone settling a personal score.
Most agent frameworks respond to this mess with choreography. They force the model into loops: reason, act, observe; plan, execute, revise; call one of the tools already placed in front of it. These loops are useful, but they quietly assume that the system designer already knows the shape of the task. That assumption is the luxury. Real work rarely sends a memo in advance.
The paper behind DeepAgent, DeepAgent: A General Reasoning Agent with Scalable Toolsets, attacks that assumption directly.1 Its central claim is not simply that another agent scores higher on another benchmark. We have enough leaderboard confetti, thank you. The more interesting move is architectural: DeepAgent lets a large reasoning model think, search for tools, call them, and compress its own memory inside one continuous reasoning process.
That distinction matters. The paper is not just asking whether an agent can use a tool. It asks whether the agent can decide which tool world to enter while the task is already unfolding.
The real mechanism is not “more autonomy”; it is fewer fixed handrails
The easiest way to misunderstand DeepAgent is to file it under “autonomous agents are getting better.” That is true, but too vague to be useful.
DeepAgent reorganises the agent loop around four possible actions. The model can produce internal reasoning. It can issue a tool search. It can call a selected tool. It can trigger memory folding, which compresses the prior interaction into structured memory and restarts the working context from that compressed state.
That gives the agent a different operating rhythm from conventional ReAct-style systems. In a standard workflow, the developer often defines the cycle and exposes a fixed or pre-selected set of tools. In DeepAgent, the model can search for relevant tools during the reasoning process, using natural-language queries that are matched against an indexed tool documentation base. Retrieved tool documentation is then filtered or summarised before being returned to the main reasoning model.
The result is not magic. It is still infrastructure. Tool documentation must be prepared. Tool retrieval depends on embeddings. Calls must be parsed, executed, and returned. A separate auxiliary model helps summarise long tool outputs, filter documentation, simulate APIs during training, and fold memory. The wizard, as usual, has a backend.
But the mechanism changes where flexibility lives.
| Design choice | What DeepAgent changes | Operational consequence |
|---|---|---|
| Tool access | Tools are dynamically searched during reasoning rather than only supplied upfront | Large tool libraries become usable without stuffing every API into the prompt |
| Action execution | Tool calls are generated as structured actions inside the reasoning stream | The model can interleave investigation, execution, and correction |
| Memory management | The agent can fold history into episodic, working, and tool memory | Long tasks can reset context without simply forgetting the plot |
| Training signal | ToolPO rewards both final success and intermediate tool actions | The model is trained not only to be right, but to call the right things on the way |
This is why a mechanism-first reading is more useful than a benchmark-first reading. The benchmarks show whether the machine moves. The mechanism explains why it does not immediately trip over its own toolbelt.
Dynamic tool discovery is the paper’s most practical idea
Enterprise AI teams usually face one of two bad options. They either expose a small curated toolset and limit what the agent can do, or they expose a larger ecosystem and watch the model drown in options. DeepAgent’s tool search mechanism is a response to that second problem.
The system builds an index of tool documentation. When the main model decides it needs help, it emits a tool-search query. The retrieval system returns candidate tools based on similarity between the query and tool documentation. If the documentation is too long, the auxiliary model condenses it before handing it back. The main model then decides whether and how to call a tool using a structured call format.
This is not the same as true open-ended discovery across the internet or a corporate software estate. The tools are still represented in advance through documentation. The agent is not wandering through procurement systems signing SaaS contracts on your behalf, which is merciful. But it does move beyond the brittle assumption that all necessary tools must be selected before the task begins.
The paper’s open-set experiments are the key evidence here. In the open-set setting, the agent must retrieve tools from the full toolset rather than receiving the ground-truth tools upfront. On ToolBench, with over 16,000 APIs, DeepAgent-32B-RL reaches 64.0 Pass@1, compared with 54.0 for the strongest 32B workflow baseline reported in the table. On ToolHop, which uses 3,912 executable tools and requires multi-hop tool use, DeepAgent-32B-RL reaches 40.6, compared with 29.0 for the best 32B workflow baseline.
The more revealing comparison is Table 4, where the authors test input-level pre-retrieval against autonomous retrieval during reasoning. DeepAgent with input-retrieved tools averages 42.0 across ToolBench, ToolHop, TMDB, and Spotify. With autonomous retrieval, it rises to 52.6. The best workflow-based method with autonomous retrieval reaches 28.5.
That is not a small formatting preference. It says retrieval timing matters. Tool discovery works better when the model can ask for tools after it has understood more of the task.
For businesses, the implication is straightforward: the long-term asset is not just an agent prompt. It is a governed, searchable tool registry with clean documentation, permission boundaries, observability, and retrieval quality controls. The prompt is the visible tip. The tool infrastructure is the iceberg, and unfortunately the iceberg has a Jira board.
Memory folding is less about token saving than strategic recovery
The paper’s most memorable phrase is that memory folding lets the agent “take a breath.” The phrase is slightly anthropomorphic, but the mechanism is useful.
Long-horizon agents accumulate history. That history contains useful facts, dead ends, intermediate outputs, tool errors, partial plans, and increasingly large amounts of context that the model must keep mentally stepping over. A longer context window helps, but it does not automatically produce better judgement. Sometimes it just gives the model a larger attic in which to misplace things.
DeepAgent’s memory folding compresses prior interaction into three structured components:
| Memory type | What it preserves | Why it matters |
|---|---|---|
| Episodic memory | High-level task progress, major events, decisions, outcomes | Maintains the story of the task |
| Working memory | Immediate goal, current challenges, next actions | Keeps the agent oriented after compression |
| Tool memory | Tools used, parameters, success patterns, errors, derived rules | Helps avoid repeating bad calls or forgetting effective ones |
The appendix is important here because it shows that memory is not an informal summary. The authors define fixed JSON-style schemas for each memory type. That design choice matters more than the “brain-inspired” label. Structured memory is easier for the agent to parse and reuse than a poetic paragraph about how the journey has been difficult but meaningful.
The ablation study gives memory folding its strongest support. Removing ToolPO training drops the average score across selected tasks from 48.1 to 44.3. Removing memory folding gives a similar average decline to 44.2, with a particularly visible fall on GAIA from 53.3 to 44.7. The likely purpose of this ablation is component attribution: it asks whether memory folding is decorative or doing real work. The answer is that, at least on these benchmark tasks, it is not decorative.
For enterprise use, the lesson is not “copy this exact memory schema and call it a brain.” The lesson is that agent memory needs operational structure. A useful production agent should know what has been tried, what failed, what state changed, which tool outputs are reliable, and what the next constrained action should be. Without that, long-running automation becomes a very expensive form of amnesia.
ToolPO trains the agent on the action, not only the answer
DeepAgent’s training method, Tool Policy Optimization, is the paper’s third major contribution. The problem it addresses is familiar to anyone who has evaluated agents: final answer rewards are too blunt.
If an agent completes a task successfully, many intermediate actions may have contributed. Some may have been lucky. Some may have been unnecessary. Some may even have been wrong but rescued later. If the training signal only rewards final success, the model may learn that a messy trajectory is acceptable as long as the ending looks good. That is not a comforting philosophy for enterprise systems, unless your compliance department enjoys jazz improvisation.
ToolPO separates the learning signal into global task success and local action quality. The global reward applies to the trajectory. The local reward is targeted at tool calls and memory-fold actions. The authors then attribute the action-level advantage only to the tokens that constitute those actions, rather than spreading it across the whole generated sequence.
In plain English: the model gets credit for solving the task, but it also receives more precise feedback for making good tool decisions along the way.
The method also uses LLM-simulated APIs during training. That choice is not a minor implementation convenience. Training against thousands of real APIs would be slow, unstable, costly, and dependent on services that may rate-limit, fail, change, or return inconsistent outputs. Simulated APIs provide a more controlled RL environment.
The boundary is equally important. A simulator can stabilise training, but it may also simplify the nastiness of real integrations. Production APIs have authentication, permissions, latency spikes, pagination quirks, undocumented edge cases, business rules, and failure modes that benchmark environments usually treat with enviable politeness. ToolPO is promising because it gives tool use a finer training signal. It does not prove that the trained agent is ready to operate unchecked inside a live ERP system. Please do not give the benchmark a purchase order.
The experiments support the architecture, not a universal autonomy claim
The paper evaluates DeepAgent across general tool-use tasks and downstream applications. The main evidence is broad: ToolBench, API-Bank, TMDB, Spotify, ToolHop, ALFWorld, WebShop, GAIA, and Humanity’s Last Exam.
The numbers are strongest when interpreted at the right level of comparison. DeepAgent-32B-RL generally beats 32B workflow-based baselines across the reported tool-use settings. In labelled-tool scenarios, it reaches 89.0 on TMDB and 75.4 on Spotify, ahead of the strongest 32B baselines in those columns. In open-set retrieval, the advantage becomes more strategically interesting because the system must discover tools from larger pools.
On downstream tasks, DeepAgent-32B-RL reports 91.8 on ALFWorld success, 34.4 WebShop success with a 56.3 score, 53.3 overall on GAIA, and 20.2 overall on HLE. It performs strongly among the 32B systems in the table. However, the reference results from larger or closed systems are not uniformly beaten. OpenAI’s o3-based Deep Research, where reported, scores higher on GAIA overall and HLE overall. Claude-4 is also competitive or stronger on several individual downstream metrics.
That does not weaken the paper. It clarifies it. DeepAgent is not presented most usefully as “the best agent now exists.” It is better read as evidence that changing the agent architecture can produce substantial gains without relying solely on a larger frontier model.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main tool-use tables | Main evidence | DeepAgent improves over workflow baselines, especially when tool retrieval is open-set | Universal superiority over all closed or larger systems |
| Downstream task table | Main evidence / transfer test | The architecture transfers beyond API benchmarks into embodied, shopping, research, and exam-style tasks | Readiness for messy enterprise deployment |
| Training dynamics figure | Comparison with prior RL training | ToolPO appears more stable than GRPO in reward and validation curves | That simulated API training captures all real API failures |
| Ablation table | Component attribution | ToolPO, memory folding, simulator use, and tool-call attribution each contribute | That each component is optimally designed |
| Retrieval strategy table | Mechanism test | On-demand tool retrieval beats pre-retrieval, especially for larger toolsets | That retrieval will work without high-quality tool documentation |
| Action-limit scaling figure | Robustness / sensitivity test | DeepAgent benefits more from longer interaction horizons than ReAct | That more actions are always economically worthwhile |
| Backbone comparison table | Generalisation check | The DeepAgent pattern works with different reasoning backbones | That architecture removes dependence on model quality |
The action-limit experiment deserves special attention. As maximum action limits increase, both DeepAgent and ReAct generally improve, but the performance gap widens, especially on WebShop. The likely purpose is a sensitivity test: does the architecture still help when the agent is allowed to take more steps? The result suggests DeepAgent uses longer horizons more productively. Business translation: autonomy is not just about permitting more actions. It is about whether the agent spends those actions intelligently.
The business value is not “replace workflows”; it is redesign where workflows live
A lazy reading of DeepAgent would conclude that fixed workflows are obsolete. That would be convenient, dramatic, and mostly wrong.
Businesses do not need agents that ignore workflows. They need agents whose workflows are partly internal, partly governed, and partly observable. DeepAgent points toward that balance. The reasoning model can decide when to search and act, but the available tools, documentation, permissions, schemas, and execution environment remain controlled by the system.
This suggests a practical architecture for enterprise adoption:
- Build a governed tool registry, not a random API drawer.
- Standardise tool documentation so retrieval has something reliable to retrieve.
- Log tool searches, tool calls, observations, failures, and memory folds as first-class events.
- Evaluate intermediate action quality, not only final task completion.
- Use structured memory for long-horizon tasks instead of dumping everything into context.
- Keep human approval gates around irreversible, financial, legal, or externally visible actions.
The ROI pathway is therefore indirect but concrete. DeepAgent-style systems could reduce the cost of building bespoke workflow automations because the agent can discover and combine tools dynamically. They could improve robustness in multi-step tasks because memory folding reduces context clutter and supports recovery. They could improve training and evaluation because tool-call attribution reveals which actions are actually helping.
But the uncertainty boundary is large. The paper’s evidence comes from benchmark environments and controlled toolsets. Production settings add adversarial users, permission boundaries, PII handling, API contracts, audit requirements, service-level objectives, and cost constraints. A system that performs well on ToolBench has not automatically passed your regulator’s audit. Strange, I know.
What Cognaptus would actually take from this
The most actionable idea in DeepAgent is not to wait for a fully general digital employee. It is to change how firms prepare for agents.
If agents are becoming better at dynamic tool discovery, then businesses should invest in making their tools discoverable, safe, and machine-usable. That means API descriptions written for models as well as humans. It means schemas that are consistent. It means retrieval indexes that are tested. It means clear action permissions. It means memory logs that distinguish what the agent knows, what it tried, what failed, and what remains uncertain.
The firm that benefits from this kind of research will not be the one that simply buys the newest model and whispers “be autonomous” into the prompt. It will be the firm that treats tool infrastructure as an AI product layer.
DeepAgent also shifts how agent evaluation should be designed. Final-answer accuracy is not enough. Organisations need to measure tool-selection accuracy, unnecessary action rates, recovery from failed calls, memory compression quality, escalation behaviour, and cost per successful task. Those are less glamorous than a single benchmark score. They are also what determines whether automation survives contact with Monday morning.
The boundary: benchmark autonomy is not enterprise autonomy
DeepAgent is impressive, but it does not dissolve the hard parts of deployment.
First, tool documentation is indexed in advance. The agent dynamically retrieves from a prepared tool universe; it does not validate arbitrary tools discovered in the wild. Second, ToolPO uses LLM-simulated APIs during training. That improves stability, but simulation can hide the operational weirdness that makes real integrations expensive. Third, the benchmark suite is broad, yet it is still a suite. It does not fully test adversarial behaviour, fraud attempts, permission leakage, contractual obligations, irreversible actions, or multi-user organisational politics. Humanity’s Last Exam may be difficult, but it does not ask whether the procurement system should approve a vendor with missing tax documents.
The compute profile is also non-trivial. The reported training setup uses 64 NVIDIA H20-141GB GPUs. This is not exactly “run it beside the office printer.” Firms can still learn from the architecture without reproducing the full training regime, but that distinction should remain visible.
Finally, the paper does not eliminate the need for workflow design. It relocates some of it. Instead of designing every step in advance, system builders design the tool universe, retrieval layer, memory schema, execution constraints, and training/evaluation signals. Less choreography, more stage management.
The agent stops being a script and starts becoming an operating process
DeepAgent’s contribution is best understood as a step toward agentic coherence. Thinking, acting, tool discovery, and memory are no longer bolted together as separate procedural blocks. They are integrated into a single reasoning process, with auxiliary mechanisms supporting retrieval, compression, simulation, and training.
That matters because real tasks do not arrive as neat chains of pre-labelled actions. They unfold. The agent must discover what it needs, recover from wrong turns, remember selectively, and learn which intermediate actions deserve credit. DeepAgent shows that this reorganisation can improve performance across a wide range of tool-use and downstream benchmarks.
For business readers, the conclusion is neither “agents are solved” nor “wait for the next model.” The conclusion is more useful: if agent systems are moving toward dynamic action, then organisations need to prepare the action environment. Tools must become searchable. Memory must become structured. Evaluation must become process-aware. Governance must sit inside the operating layer, not in a slide deck reviewed quarterly by people who still call every model “the algorithm.”
Deep thinking is useful. Dynamic acting is useful. The serious work begins when both are allowed into the same system without pretending the guardrails are optional.
Cognaptus: Automate the Present, Incubate the Future.
-
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou, “DeepAgent: A General Reasoning Agent with Scalable Toolsets,” arXiv:2510.21618v3, 2026. https://arxiv.org/abs/2510.21618 ↩︎