Every enterprise AI team eventually meets the same annoying bill: the agent that thinks too much.
It calls tools when a direct answer would do. It loops through evaluator prompts for tasks that need one clean instruction. It drags a code interpreter into a problem that is mostly reading comprehension. Then, after all that expensive theatre, it may still be wrong. Very impressive. Very modern. Very invoicable.
The paper behind EGuR, short for Experience-Guided Reasoner, is interesting because it does not treat this as a prompt-writing problem. It treats it as a strategy-selection problem.1 The question is not simply, “What should the model say?” It is, “What computational procedure should the system run for this kind of problem, given what it has already learned?”
That shift sounds small until you look at what most “memory” systems actually do. They remember text. They retrieve notes. They prepend context. They nudge the same underlying agent to behave slightly differently. EGuR goes further: it uses experience to generate a complete executable reasoning strategy, including prompts, tool availability, sampling parameters, control flow, and whether the system should behave like a simple workflow or a more agentic loop.
That is the paper’s central contribution. Not memory as a bigger notebook. Memory as a strategy compiler.
The wrong comparison is “agent with memory” versus “agent without memory”
A tempting reading of EGuR is that it is another entry in the long-term-memory-for-agents category. That is close enough to be misleading, which is the most dangerous kind of close.
The paper’s actual comparison is sharper. It separates three regimes:
| Regime | What adapts | What stays fixed | Business consequence |
|---|---|---|---|
| Fixed strategies | Nothing meaningful after deployment | The workflow, tools, prompts, and control flow | Predictable architecture, repeated mistakes, repeated costs |
| Prompt-steered memory | Textual input to a fixed strategy | The underlying agent or workflow | Better context, but limited ability to reduce structural waste |
| EGuR-style strategy adaptation | The complete computational procedure | The meta-process that generates and updates strategies | Potentially better accuracy-cost trade-offs when feedback is reliable |
This distinction matters because many enterprise AI deployments are already full of fixed scaffolds: chain-of-thought prompts, retrieval steps, code execution loops, evaluator-optimizer patterns, self-consistency sampling, and tool-calling agents. Each scaffold is useful somewhere. None is universally appropriate.
The paper makes this point explicitly with its strategy landscape. Code-based strategies perform well on some algorithmic tasks, such as 3-SAT and word sorting, but can perform poorly on AIME and movie recommendation. Eval-Opt can reach accuracy comparable to self-consistency while often costing less. CodeAct, the most general agentic strategy in the set, is not automatically the best one.
There is a lesson here that should be tattooed onto a few procurement decks: more agentic is not the same as more effective. A general agent can, in theory, emulate simpler workflows. In practice, it may fail to choose the right behaviour and charge you for the privilege.
EGuR treats a reasoning method as an object, not a vibe
The paper formalises a strategy as a composition of stateful processes. In plain language, a strategy is the procedure an AI system runs to turn an input into an output: an LLM call, a tool invocation, a branch, a loop, a parallel sample, a verifier, or some composition of those pieces.
This formalism is not decorative mathematics. It gives the authors a way to talk about workflows, pipelines, agents, tool-use systems, costs, and traces inside one common frame.
A pipeline is relatively static. A workflow may branch. An agent is recursive: it can repeatedly decide, call tools, observe results, and continue. Parallelisation can be used for majority voting or candidate generation. Tools can be present or absent. Sampling parameters can be conservative or exploratory.
Once strategies are treated as composable objects, they can be generated, compared, cached, and reused. That is where EGuR enters.
EGuR has two main components.
The first is the Guide. Given the current problem and accumulated experience, it generates candidate strategies. These are not just suggestions like “be careful with arithmetic.” They are complete strategy specifications: which prompts to use, which tools are available, which parameters apply, and what control flow should govern execution.
The second is the Consolidator. After strategies run, EGuR collects the answer, trace, cost, and verifier feedback. The Consolidator updates a structured memory that includes a strategy library and general notes about what works, what fails, and under which problem characteristics.
That memory then conditions future strategy generation. The system is not merely remembering facts. It is remembering operational lessons: when code helps, when code hurts, when a single call beats an agent loop, when deterministic settings are safer, and when extra exploration is worth its cost.
This is why the paper’s “strategy as a service” implication is more than a cute phrase. The service is not an answer. The service is an evolving decision about how to produce the answer.
The real fight: fixed CodeAct, prompt memory, or generated strategy
The baseline comparison is useful because it exposes where different adaptation methods run out of room.
CodeAct is the stateless agent baseline. It can use a code interpreter in an iterative loop. It is flexible within a single episode, but it does not learn across episodes.
Dynamic Cheatsheet adds memory by appending accumulated notes to the agent’s input. This can help, but it also grows the context and can inflate cost. The paper reports that Dynamic Cheatsheet often exceeds $1.00 per sample in Claude experiments and reaches much higher per-sample costs after training on several tasks.
Mem0 adds a vector-database memory layer to CodeAct. It retrieves past information more selectively than a growing cheatsheet, but the underlying structure is still CodeAct. It can steer the same agent; it cannot decide that the agent itself is the wrong instrument.
EGuR changes the object of adaptation. It can remove the code interpreter when it harms performance. It can alter sampling parameters. It can switch from a multi-turn agent to a one-call workflow. It can cache a successful procedure and reuse it instead of rebuilding the same expensive process.
That difference is architectural, not cosmetic.
A prompt-memory system says: “Here is what we learned; please behave accordingly.”
EGuR says: “Here is what we learned; generate a different machine.”
The main evidence supports an accuracy-cost trade-off, not magic self-improvement
The paper evaluates EGuR on five benchmarks: AIME, 3-SAT, and three Big Bench Extra Hard tasks: movie recommendation, word sorting, and object counting. The models tested are Claude 3.7 Sonnet, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B. The evaluation reports prequential accuracy, meaning performance is measured before updating on each sample, and execution cost, excluding system feedback and update costs.
That last exclusion matters. The paper’s reported execution costs focus on running the strategies, not the full accounting cost of Guide generation, Consolidator updates, and infrastructure around the learning loop. This does not invalidate the findings, but it narrows how directly they translate into production ROI. An enterprise version would need full-system cost accounting. The CFO, annoyingly, will insist.
The headline results are still notable. Across the evaluated benchmarks, EGuR reports up to 14% accuracy improvement and up to 111× cost reduction relative to the strongest baselines in particular comparisons. The Claude results are especially strong.
On Claude 3.7 Sonnet, the appendix reports EGuR-5 reaching 96.0% prequential accuracy on 3-SAT at a cost of $0.152, compared with CodeAct at 77.0% and $0.257, and Dynamic Cheatsheet at 89.9% and $76.353. On object counting, EGuR-5 reaches 96.0% accuracy at $0.075, compared with CodeAct at 80.0% and $0.098, Mem0 at 87.0% and $0.174, and Dynamic Cheatsheet at 87.0% and $24.895.
Those numbers illustrate the point beautifully: the biggest business win is not always “more reasoning.” Sometimes it is learning when to stop doing the expensive thing.
But the evidence is not uniform across all models and tasks. GPT-OSS results are more mixed. In Table 3, GPT-OSS CodeAct and Mem0 outperform EGuR-3 on several accuracy measures, including AIME and movie recommendation, while EGuR-3 is cheaper on some tasks. Qwen results also vary: EGuR-3 improves AIME and object counting in Table 3, but underperforms CodeAct or Mem0 on some other tasks.
So the honest reading is this: EGuR demonstrates a promising mechanism for improving the accuracy-cost frontier, especially in the Claude setting and especially where strategy choice strongly determines cost. It does not prove that every model, every benchmark, or every enterprise workflow will benefit automatically.
That is not a flaw. It is the interesting part. Strategy adaptation is only valuable when strategy actually matters.
The ablation tests whether comparison teaches better strategy choice
The paper’s exploration-level experiment is best read as an ablation, not as a second thesis.
EGuR-ZS is a zero-shot version without memory updates. EGuR-1 generates one strategy per problem and learns from absolute feedback. EGuR-5 generates five strategies per problem, enabling comparative evaluation.
The purpose of this test is to separate three possible explanations:
| Test variant | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| EGuR-ZS | Checks whether the Guide alone is already strong | Strategy generation can be useful even without learning | That experience is responsible for all gains |
| EGuR-1 | Tests memory updates without comparison | Learning from feedback helps across episodes | That one strategy gives enough information about alternatives |
| EGuR-5 | Tests comparative strategy evaluation | Multiple candidates help identify better and cheaper procedures | That larger exploration is always cost-effective in production |
The result: memory helps, and comparison helps more on most tasks. EGuR-5 generally outperforms EGuR-1 in the Claude experiments, with particularly large gains on 3-SAT and object counting. The paper also reports that higher exploration can reduce later costs because the system discovers efficient strategies faster.
This is an elegant result because it turns extra upfront compute into a learning instrument. Running multiple strategies is not just an ensemble trick. In EGuR, it creates comparative evidence for future decisions.
Still, this is not a free lunch. In a production environment, the correct exploration factor would depend on task recurrence, error cost, latency tolerance, and how reliable the feedback signal is. If a task appears once, five strategies may be theatre. If a task family appears ten thousand times, exploration may be cheap tuition.
The object-counting example is the paper in miniature
The most revealing qualitative result is object counting.
A naïve engineer might assume that object counting benefits from code. The task involves quantities. There are additions and changes. Surely a code interpreter should help.
EGuR learns otherwise. For BBEH object counting, it converges on a single LLM call with detailed instructions, rather than a CodeAct loop. The learned strategy gives guidance on parsing text, categorising items, and handling quantity changes. In the appendix example, a pre-training strategy answers incorrectly and costs $0.166, while the post-training strategy answers correctly and costs $0.062.
That example captures the whole argument.
The winning move was not “use the most powerful agent.” It was “remove the unnecessary machinery and write the right procedure.” This is the sort of adaptation that prompt memory alone struggles to enforce. A fixed CodeAct agent can be told that code may be harmful. But it is still a CodeAct agent. EGuR can simply stop being one for that task.
The same pattern appears in word sorting. EGuR learns to distinguish algorithmic sorting problems, where Python sorting can help, from reasoning problems about logical mistakes in explanations, where chain-of-thought with a fallback is more appropriate.
This is the operational intelligence businesses actually need: not a universal agent, but a system that learns which kind of procedure a task deserves.
What Cognaptus infers for enterprise systems
The paper directly shows that EGuR can generate and adapt inference-time strategies across benchmark tasks using verifier feedback. It reports strong Claude results, meaningful cost reductions in several settings, and qualitative evidence that the system learns useful tool-selection and workflow-selection heuristics.
The business inference is broader but should be kept disciplined.
For enterprise AI, EGuR points toward a layer above today’s agent orchestration platforms: a strategy-management layer. That layer would monitor task families, compare workflows, cache successful procedures, and decide whether the next request needs a tool-heavy agent, a deterministic single call, a code run, a majority vote, or a retrieval-plus-answer workflow.
The ROI case would not come from mystical self-improvement. It would come from three operational effects.
First, repeated task families could become cheaper. If a system learns that a simple workflow solves a recurring class of requests, it can stop launching expensive agent loops. This matters for customer support, document analysis, compliance screening, internal analytics, and any other domain where the same types of requests arrive repeatedly under slightly different wording.
Second, failure modes could become less repetitive. Instead of recording that “the answer was wrong,” the system records which strategy failed, what it cost, and what traces preceded the failure. That is a more useful memory for operations teams because it links errors to procedure design.
Third, governance becomes more concrete. EGuR generates explicit strategies. That means the organisation can inspect, log, compare, and potentially approve the procedures being used. A black-box answer is hard to govern. A generated workflow with tool settings, prompts, and control flow is at least something compliance can argue with. Progress.
There is also a product implication. “Agent builder” platforms often focus on assembling workflows manually. EGuR suggests a future where the platform itself learns a portfolio of workflows and routes tasks into generated or cached procedures. That is closer to process optimisation than chatbot deployment.
The boundary: EGuR needs feedback, recurrence, and a capable Guide
The paper’s limitations are not footnotes to be politely ignored. They define where the method can work.
The first boundary is feedback. EGuR learns from verifier signals. In the experiments, most tasks use ground-truth comparison, and 3-SAT uses a satisfiability checker. That is a clean learning environment. Many enterprise tasks are messier: “Was this legal memo good?”, “Did this sales recommendation help?”, “Was this procurement risk assessment complete?” Those answers may require human review, delayed business outcomes, or unreliable LLM-as-judge signals.
The second boundary is recurrence. Strategy learning has value when task families repeat. If every request is unique, there is less opportunity to amortise exploration and memory over future use. EGuR is most compelling where a business sees patterned variation: many invoices, many support tickets, many compliance questions, many market summaries, many coding tasks of related form.
The third boundary is Guide competence. The Guide must be able to generate viable strategy code zero-shot from context. If the problem type is unfamiliar or the available tools are complex, the generated strategies may be poor. The authors note that training or optimising the Guide could be needed in such cases.
The fourth boundary is memory management. The Consolidator decides what to retain and what to delete. That is powerful, but also a new failure surface. Keep too much and memory becomes bloated. Keep too little and the system forgets valuable procedures. Keep the wrong abstraction and the Guide inherits bad operational folklore. Anyone who has seen a corporate wiki decay into archaeological sediment will recognise the risk.
Finally, the reported costs exclude some system-level overhead. Strategy execution becomes cheaper in many cases, but full deployment economics must include Guide calls, Consolidator calls, storage, monitoring, validation, and safety review.
None of these boundaries kill the idea. They make it deployable only with adult supervision, which is generally where enterprise AI begins.
The strategic lesson is not “build smarter agents”
The easy conclusion is that EGuR is another step toward self-improving agents. That is true, but too broad to be useful.
The sharper conclusion is that enterprises should stop treating “agent” as the default unit of intelligence. Sometimes the right unit is a prompt. Sometimes it is a workflow. Sometimes it is a tool call. Sometimes it is a recursive agent loop. Sometimes it is a cached strategy that worked last time and does not need to rediscover itself with twenty thousand tokens of existential wandering.
EGuR’s value is that it makes this choice explicit and learnable.
That is why the comparison-based framing matters. Fixed agents are brittle because they do not learn. Prompt-memory agents are limited because they can only steer what already exists. Offline strategy optimisation is powerful but static after deployment. EGuR sits in the middle: online, stateful, and able to generate full procedures at inference time.
For businesses, the promise is not that AI will suddenly “think like a strategist.” Spare us. The useful promise is narrower and better: AI systems may learn which strategy to use, when to use it, and when the expensive clever thing should be replaced by the cheap boring thing that works.
Boring, in production, is often another word for margin.
Cognaptus: Automate the Present, Incubate the Future.
-
Adam Stein, Matthew Trager, Benjamin Bowman, Michael Kleinman, Aditya Chattopadhyay, Wei Xia, and Stefano Soatto, “Experience-Guided Adaptation of Inference-Time Reasoning Strategies,” arXiv:2511.11519, 2025, https://arxiv.org/abs/2511.11519. ↩︎