Search is easy. Knowing when to go back is harder.

That is the useful irritation inside GSM-Agent, a new benchmark for studying agentic reasoning under controlled conditions.1 The paper takes grade-school maths problems from GSM8K, removes the premises from the prompt, hides those premises in a searchable document database, and asks an LLM agent to recover the facts before solving the problem. The arithmetic is not supposed to be impressive. That is the point. If a model fails here, we cannot calmly blame differential geometry, PhD-level law, or some mysteriously adversarial enterprise workflow. The agent simply did not find and use the facts.

The paper’s central result is therefore not “models are bad at maths”. We already had enough theatrical evidence for that. The more interesting result is behavioural: strong agents do not merely search more. They return to earlier search territory after learning something new. They revisit.

That small distinction matters for business systems. Many production agents are evaluated as if the final answer were the whole story: right invoice classification, right support reply, right compliance flag, right CRM match. GSM-Agent argues for a more diagnostic view. The path matters. An agent that never reopens a promising lead is not being efficient. It may simply be giving up with better typography.

GSM-Agent turns solved maths into missing-evidence work

Most reasoning benchmarks give the model the problem and all facts needed to solve it. GSM-Agent changes only one thing: the model sees the question but not the premises. The missing premises are transformed into context-rich documents and placed inside a vector-search database. The agent receives tools such as Search(query) and NextPage(), then must decide what information is missing, search for it, inspect retrieved documents, continue if needed, and finally produce the numerical answer.

This is a neat experimental trick because it separates two capabilities that are often bundled together:

Capability Static benchmark version GSM-Agent version
Reasoning Solve once all facts are visible Solve only after collecting the facts
Retrieval Not really tested Decide what to search for
Workflow control Mostly irrelevant Decide whether to continue, branch, revisit, or stop
Failure diagnosis “Wrong answer” Missing evidence, bad query, premature stop, poor recovery, or arithmetic error

The benchmark is also controllable. The authors preprocess GSM8K problems to avoid entity ambiguity, assign names to generic entities, use timestamps to separate repeated names, decompose each problem into self-contained premises, convert those premises into documents, check document independence, anonymise some documents so the agent cannot simply search the protagonist’s name, and build databases of different sizes.

That construction matters. Without it, an agent benchmark can easily become a soup of confounders: maybe the task is too hard, maybe the retrieval system is noisy, maybe the model lacks domain knowledge, maybe the answer format broke, maybe the dataset is haunted. GSM-Agent is not free of design choices, but it is unusually clean about the question it wants to ask: when the underlying reasoning problem is easy, how much does agentic information-gathering still hurt?

Quite a lot, apparently.

The performance gap is too large to dismiss as arithmetic

In the main zero-shot ReAct evaluation, the top models still leave a large amount of performance on the table. o3 reaches 68.46% accuracy, GPT-5 reaches 66.78%, Grok-4 reaches 53.00%, and Claude-4-Sonnet reaches 56.00% in the reported run. Below that, performance drops sharply: Gemini-2.5-Pro is at 38.33%, Kimi-K2-Instruct at 37.42%, GPT-4o at 22.67%, DeepSeek-V3 at 19.42%, Qwen3-235B at 19.30%, and Llama-4-Scout at 12.54%.

Those numbers are uncomfortable because the source tasks are grade-school maths. The difficulty has been moved from calculation into evidence acquisition. The agent is not asked to invent a proof. It is asked to find the right bits of information and not stop too early. Apparently that is enough to turn respectable models into interns who close the ticket after reading one paragraph.

The paper also reports search-complete rate: the proportion of tasks where the agent finds all relevant documents. GPT-5 reaches 52%; o3 reaches 53%; Grok-4 and Claude-4-Sonnet reach 42%. Several lower-performing models are far below that. This gives the accuracy drop a more operational interpretation. The issue is not merely that models reason incorrectly after retrieving the evidence. Often, they never assemble the evidence in the first place.

A tempting response is to say: fine, give the agent more turns. More search rounds, more tokens, more chances. The usual test-time scaling instinct. Add budget and pray. Enterprise AI has tried worse strategies, usually with invoices attached.

GSM-Agent tests that instinct. The authors examine interaction-time scaling by prompting selected open models to continue searching when they try to stop, and compare that behaviour with GPT-5. GPT-5 shows stronger scaling with more interaction rounds. The open models improve only weakly or inconsistently. In the appendix, forced interaction can become expensive without becoming wise: more rounds may increase search-complete rates, but can also produce long traces, premature attempts, and inefficient wandering.

This is the first important mechanism in the paper: time is not the same as control. Longer traces can help only if the extra steps are directed toward unresolved evidence. Otherwise, the agent is just paying rent in the search engine.

The agentic reasoning graph turns wandering into a measurable path

The paper’s most useful contribution is not the benchmark leaderboard. Leaderboards are fine, in the same way airport departure boards are fine: informative, but not a theory of aviation.

The better contribution is the agentic reasoning graph. The authors cluster document embeddings in the environment into semantic nodes. Then they map each tool call onto the nearest node. A search query maps to the node closest to the query embedding; a NextPage() call stays associated with the current query’s node. The agent’s sequence of tool calls becomes a path through this graph.

Once the path exists, each step can be classified:

Path behaviour Meaning Operational analogue
Exploration Visiting a node for the first time Trying a new topic, entity, document family, or hypothesis
Exploitation Staying in the same node as the previous step Paging or digging within the current topic
Revisit Returning to a previously visited node after leaving it Coming back to a promising topic with new context or a refined query

This is simple, and that is why it is useful. It converts a messy tool-use transcript into a compact behavioural signature. Instead of reading hundreds of agent logs and muttering “hmm, vibes”, one can compute ratios for exploration, exploitation, and revisit.

The paper finds that accuracy has weak correlation with exploration ratio, strong correlation with revisit ratio, and negative correlation with exploitation ratio. The exact lesson is not that exploration is bad. A model must discover new regions. The lesson is that strong agents appear to do something more specific: they leave, learn, and return.

That is a very different behaviour from just paging endlessly in the same area. Exploitation can look diligent in logs because the agent is doing more work. But if it is stuck inside the wrong local neighbourhood, diligence becomes a very professional form of being lost.

Revisit is not repetition; it is return with context

The word “revisit” can sound like a polite synonym for looping. It is not. A loop repeats because it has lost state. A useful revisit returns because state has improved.

The appendix example makes this concrete. In one o3 trace, the question asks how much more Kelly paid than Becky in May 1990. The relevant documents include purchase logs, price labels, order slips, discount authorisations, and discount records. The agent first searches around payments and ledgers, touches a financial-record node, leaves it, and later returns with a better angle. A query about Becky’s May 1990 pay finds beckys_discount_record. Later, after discovering more of the apple-purchase context, a query about Kelly’s coupon finds kellys_discount_authorization and kellys_apple_order_slip.

The mechanism is subtle. The agent does not merely “search longer”. It accumulates partial structure: there are apples, there are two buyers, there are price labels, there are discounts, there is a date, and some documents are hidden behind different wording. Revisit lets the agent use later discoveries to improve earlier searches.

This is why the finding is relevant beyond maths. Enterprise workflows often have exactly this shape:

  • A support agent finds the product family, then later learns the error code.
  • A finance agent finds the supplier invoice, then later discovers the purchase-order exception.
  • A compliance agent finds the regulation category, then later sees a clause that changes which evidence matters.
  • A sales agent finds the account record, then later learns the subsidiary name under which the contract was filed.

In each case, the first search may be incomplete not because it was foolish, but because the agent did not yet know what would make it precise. Revisit is how the system converts partial discovery into better retrieval.

The tool experiment tests the mechanism, not a magic button

After identifying revisit as a behavioural correlate of success, the authors test whether tools can encourage better agentic reasoning. They introduce three tool variants: a thinking tool that forces additional reasoning, an exploration tool that encourages different search queries, and a revisit tool that encourages returning to previous queries. They compare tool-augmented strategies against zero-shot and chain-of-thought prompting across several models.

The broad result supports the mechanism: tool-augmented methods often match or outperform prompt-only chain-of-thought, and increases in revisit ratio correlate with increases in accuracy. For Qwen3-235B, for example, zero-shot accuracy is 19.30%, while the revisit-tool setting reaches 45.68%. Kimi-K2-Instruct improves from 37.42% zero-shot to the mid-40s under several tool or CoT variants. Llama-4-Scout improves from 12.54% zero-shot to 19.39% with the explore-revisit tool combination and 25.51% under interaction scaling, though the latter requires far more search rounds.

The important reading is not “add a revisit tool and all agents become reliable”. The detailed tables are more nuanced. Some variants underperform on some models. Llama-4-Maverick, for instance, does not become impressive simply because tools exist; some tool settings are worse than its zero-shot baseline, while the thinking-tool variant gives only modest improvement. DeepSeek-R1 struggles with tool formatting. Claude-Opus is excluded from the fair comparison because it tends to ask the user for more information and needs special prompting.

So the tool experiment is best read as mechanism evidence. It shows that shaping the action space around revisit can improve behaviour, especially when the model is capable of using the extra structure. It does not prove that a generic revisit() button is a universal product feature. Buttons are cheap. State management is not.

What businesses should copy: telemetry before theatrics

The business implication is not “use GSM-Agent as your procurement benchmark”. It is too synthetic for that. The useful move is to copy the instrumentation pattern.

Most enterprise agent evaluations still overemphasise final-answer accuracy. That metric is necessary but insufficient. When an agent fails, the organisation needs to know whether the failure came from missing evidence, bad retrieval, premature stopping, over-exploitation of one source, poor use of context, or a genuine reasoning error after complete evidence was available.

GSM-Agent suggests a practical telemetry layer:

Production metric What it reveals Why it matters
Evidence coverage Whether required source types were found before answer Separates retrieval failure from reasoning failure
Exploration ratio Whether the agent searches across enough distinct topics Detects narrow search
Exploitation ratio Whether the agent stays too long in one topic Detects local looping
Revisit ratio Whether the agent returns to earlier promising topics Detects recovery and context-aware search
Useful revisit rate Whether revisits actually add new evidence Separates productive backtracking from thrashing
Premature-answer rate Whether the agent finalises before coverage is adequate Catches confident incompleteness

This is not glamorous. Good instrumentation rarely is. But it is the difference between “the agent got it wrong” and “the agent never reopened the supplier-record branch after learning the invoice used an alternate vendor name”. The second diagnosis can be fixed. The first just starts a meeting.

How to design revisit into a real agent workflow

A production implementation should not merely tell the model, “Remember to revisit.” That is the agentic equivalent of putting “be careful” in a policy manual and calling it governance.

A better design treats revisit as an explicit workflow capability.

First, maintain a topic ledger. Every query should be mapped to a topic, source cluster, entity, or document family. The system does not need the exact graph machinery from the paper in every case; even coarse clustering can help. The agent should know what it has touched, what it found, and what remains unresolved.

Second, add a coverage check before final answer. The check should not ask, “Are you confident?” Models are famously generous to themselves. It should ask: “Which required evidence categories are still unverified? Which earlier topic became more relevant after later discoveries?” If the answer identifies a gap, the agent should revisit before finalising.

Third, separate revisit from generic search in logs and tools. A distinct operation such as revisit(topic_id, refined_query, reason) gives evaluators a handle. It also forces the agent to state why it is going back. The reason matters: “new alias discovered”, “date range narrowed”, “contradictory evidence found”, “missing price term”, “source cluster likely incomplete”. This is how a trace becomes auditable rather than merely long.

Fourth, reward useful revisit, not raw revisit. Otherwise the agent may learn to bounce between topics like a caffeinated spreadsheet. A revisit is useful when it retrieves new relevant evidence, resolves a contradiction, verifies a necessary premise, or prevents a premature answer. Revisit ratio alone is a diagnostic signal; useful revisit is closer to an optimisation target.

Finally, keep a failure library. Store traces where the agent failed because it did not return to a promising topic. These traces are gold for prompt examples, evaluation cases, and workflow guardrails. They show the exact moment where a recoverable miss became a wrong answer.

The paper’s evidence map

The paper is strongest when read as a sequence of evidence types, not as one undifferentiated result.

Paper component Likely purpose What it supports What it does not prove
GSM-Agent construction Main benchmark contribution Static reasoning and agentic evidence-gathering can be compared more cleanly That GSM-Agent covers all real-world agent failures
Overall model table Main evidence Strong models still fail substantially when premises must be found That the ranking will transfer to every enterprise stack
Interaction-time scaling Main diagnostic test More rounds alone are not a reliable answer That longer search is never useful
Agentic reasoning graph Mechanism contribution Tool traces can be converted into measurable search topology That k-means embedding clusters are the only right graph
Revisit correlation Main mechanism finding Revisit behaviour is strongly associated with better performance That revisit causally explains all gains in every setting
Tool-augmented variants Intervention / exploratory extension Encouraging revisit can improve some models and settings That a generic revisit tool is production-ready by itself
Embedding-model ablation Robustness test Results are not entirely dependent on one embedding model That retrieval details never matter
Database-size ablation Sensitivity test Smaller databases generally make the task easier and raise search-complete rates That production databases can be simplified without cost

This distinction is important because the paper can otherwise be overread. The revisit result is promising. It is not a license to staple a “backtrack” command onto every agent and announce metacognition. A magnificent amount of AI product design consists of renaming ordinary logging as cognition. Let us not assist.

Boundaries: synthetic documents are not the whole enterprise mess

GSM-Agent is deliberately clean. That is its strength and its limitation.

The documents are generated from decomposed GSM8K premises. The database is controlled. The answers are numerical. The required evidence is knowable. The environment is search-centric. Real enterprise environments are less polite. Documents conflict, permissions interfere, APIs fail, evidence may be missing, and sometimes the correct action is to ask a human rather than search again. Also, business tasks often involve judgement, policy trade-offs, or risk tolerance rather than a single numeric answer.

The paper’s finding should therefore be treated as a diagnostic principle, not a universal law. Revisit matters most when the task has multi-step evidence gathering, when later discoveries can refine earlier searches, and when premature stopping is a common failure mode. It may matter less when the environment is small, the relevant record is directly keyed, or the agent’s main challenge is not retrieval but domain judgement.

Even within the paper, tool augmentation is uneven across models. Some models benefit more than others. Some fail due to tool-use conventions rather than reasoning. This is another useful business lesson: agent capability is not just model intelligence. It is model behaviour inside a particular tool protocol. Change the protocol, and you may change the apparent intelligence. Annoying, but operationally true.

The real lesson: do not just grade the answer; inspect the route

GSM-Agent’s best contribution is to make agent failure less mystical. It shows that when all premises are visible, grade-school maths is easy for strong models. Hide the premises in a searchable environment, and the task becomes a test of workflow control. The agent must decide what to search, when to keep digging, when to branch, when to return, and when it has enough evidence to answer.

That is exactly the shape of many useful business agents. They are not magical answer machines. They are evidence-gathering systems with a language interface attached. If their search path is shallow, brittle, or unable to return to earlier leads, their final answers will look confident long before they are complete.

The practical takeaway is simple: build agents that can backtrack intelligently, then measure whether they actually do. Revisit is not wasted motion. In the right workflow, it is the difference between a system that merely wanders and one that learns where to look again.

And yes, this means the path matters. Terrible news for anyone hoping to evaluate agents with a single green tick.

Cognaptus: Automate the Present, Incubate the Future.


  1. Hanlin Zhu, Tianyu Guo, Song Mei, Stuart Russell, Nikhil Ghosh, Alberto Bietti, and Jiantao Jiao, “GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments,” arXiv:2509.21998, 2025. https://arxiv.org/abs/2509.21998 ↩︎