Memory Has to Earn Its Keep

TL;DR for operators

Memory is not valuable because an agent writes something down. That is called logging. Sometimes it is called “reflection,” if the logging has better branding.

The paper Enhancing Software Engineering Through Closed-Loop Memory Optimization introduces MemOp, a framework for software-engineering agents that defines memory utility by downstream impact: a memory is useful only if it improves the agent’s later performance on software tasks.¹ The important move is not the existence of Memory.md, nor the idea that past trajectories can be summarized. The important move is the loop: generate memory from an agent trajectory, validate whether that memory improves task performance, reject harmful or redundant memories, and train a memory model using the resulting accepted and rejected examples.

The paper evaluates MemOp on SWE-Bench-Verified, using software-engineering agents powered by Devstral-Small and Qwen3-Coder variants, and memory models based on Qwen, DeepSeek, and Qwen3 thinking variants. The authors test both single-episode memory, where memory from one trajectory is immediately reused, and cross-episode memory, where memory evolves across a sequence of tasks. The reported gains are not absurdly large, which is comforting. In single-episode settings, MemOp reaches absolute gains of up to 5.25 percentage points in success rate and 4.63 points in resolve efficiency. In cross-episode settings, it reaches gains of up to 3.00 points in success rate and 3.17 points in localization accuracy. The paper also reports lower evaluation-time API cost, with at least 9.79% relative reduction in the authors’ summary.

The sharp business lesson is that memory for coding agents should be treated like a governed operating component, not a vibes-based folder of “lessons learned.” Enterprise teams should ask whether memory improves resolution rate, localization, cost, variance, and time-to-fix on their own repositories. If it does not, it is not memory. It is sediment.

The boundary is equally important. The evidence is mainly on Python GitHub issue-style tasks from SWE-Bench-Verified, with sampled repositories, specific agent backbones, and expensive rollout-based curation. The paper supports a design principle and a promising mechanism. It does not prove that any vendor’s persistent memory feature will automatically improve your enterprise codebase. That would be too convenient, and therefore suspicious.

The familiar failure: every issue starts from zero

A software agent working inside a repository sees plenty. It reads files, executes commands, searches functions, edits code, observes failures, and tries again. By the end of a trajectory, it may have learned quite a lot about the repository: where tests live, which conventions matter, which paths are misleading, which modules are coupled, and which execution errors are not worth repeating.

Then the next issue begins, and the agent often behaves as if it has just arrived from a monastery with no internet.

That is the practical problem behind this paper. Modern coding agents can navigate codebases and solve real issues, but they remain largely episodic. They reconstruct context from scratch. They rediscover repository structure. They repeat failed test commands. They wander into the same local traps. More context does not automatically fix this. Long context can preserve more text, but not necessarily the right operational lesson.

The authors’ preliminary failure analysis is useful here because it keeps the paper grounded in actual agent behavior. They manually inspect 50 failed software-engineering trajectories from a GPT-4o-mini-powered agent and identify seven failure patterns: repository-structure misunderstanding, repetition, reasoning errors, coding errors, execution errors, inconsistency, and hallucination. The most important category is not exotic model stupidity. It is the agent misunderstanding the repository structure. The second major contributor, as actions and episodes increase, is repetition in long-context reasoning.

That diagnosis matters. If the failure were mostly missing syntax knowledge, memory would be a strange cure. If the failure were mostly hallucinated APIs, better retrieval might be enough. But if the agent repeatedly fails to retain repository structure, execution conventions, and task-solving patterns, then memory becomes plausible. Not guaranteed. Plausible.

And plausibility is where most agent-memory products stop. MemOp tries to go further.

The mechanism: memory utility must be measured after use

The paper’s central sentence could be rewritten as follows: memory is useful if it helps the agent solve the next task better.

This sounds obvious. It is not how many memory systems are evaluated.

A common memory pipeline stores prior interactions, summarizes them, retrieves semantically similar items, and injects them into the next prompt. The evaluation often checks whether the memory is coherent, relevant-looking, retrievable, nicely structured, or aligned with human intuition. These are not useless properties. They are also not utility.

MemOp defines utility by downstream performance. Given a candidate memory and a no-memory baseline, the framework measures performance differences across multiple software-engineering metrics. The paper uses ten metrics derived from success, localization, and efficiency:

Metric family	What it asks	Why it matters operationally
Success rate	Did the agent resolve the task?	The blunt metric management will ask for first, because management is not always wrong.
Localization accuracy	Did the agent find the right files and functions?	A coding agent that edits without locating the problem is not autonomous; it is decorative risk.
Resolve efficiency	How many problem-solving iterations were saved?	Cost and latency matter when agents run at scale.
Localization efficiency	How quickly did the agent find the right files or functions?	Early correct localization reduces wandering, redundant reads, and tool spend.

The paper represents performance change as a difference between the memory-augmented agent and the baseline:

$$ \Delta_i(M) = Q_{i,\mathrm{MemOp}} - Q_{i,\mathrm{baseline}} $$

where $Q_i$ is one of the evaluation metrics. A candidate memory is accepted only when it does not reduce any metric and improves at least one. That is stricter than merely asking whether the memory sounds useful. It is also stricter than asking a model to judge the memory. The memory has to survive contact with the task.

This definition does two jobs at once. First, it becomes an evaluation benchmark: compare memory systems by whether they improve downstream software-engineering performance. Second, it becomes an optimization signal: use validated memories as training data and rejected memories as contrastive examples. That dual use is the paper’s real contribution.

The phrase “closed loop” earns its rent here. The system does not just generate memories. It grades them through downstream execution, then trains the memory generator to produce better ones.

The loop: from trajectory to accepted memory to trained memory model

MemOp’s mechanism has four steps.

First, an agent completes a software-engineering task and produces a trajectory: actions, observations, commands, file reads, edits, failures, and successes. This is the raw experience.

Second, a memory model reflects on that trajectory and proposes candidate memories. These memories are intended to capture reusable knowledge: repository structure, effective workflows, debugging patterns, implementation constraints, and lessons that may help future tasks.

Third, the framework validates each memory by running the software agent with and without that memory and comparing downstream metrics. Candidate memories that improve performance without harming other measured dimensions enter an accepted set. Others enter a rejected set. This is trajectory-based rejection sampling.

Fourth, MemOp trains the memory model in two stages. Stage I uses supervised fine-tuning on accepted memories so the model learns the basic shape of performance-enhancing reflections. Stage II uses reinforcement learning with preference-style optimization, where rewards are grounded in measured performance differences. The model is rewarded for generating memories that improve the agent and penalized for memories that are redundant or harmful.

A simplified version looks like this:

Agent trajectory
      ↓
Candidate memory generation
      ↓
Run agent with and without memory
      ↓
Measure downstream performance difference
      ↓
Accept useful memories, reject harmful ones
      ↓
Train memory model with SFT + RL
      ↓
Generate better memories for future tasks

The point is not that SFT plus RL is fashionable. Fashionable training recipes are not in short supply. The point is that the reward is tied to the operational job memory is supposed to do.

In business language, MemOp turns memory from a knowledge-management artifact into a performance-managed subsystem.

The misconception: longer memory is not better memory

The tempting reader misconception is simple: agents fail because they forget, so give them more memory.

This paper is an antidote to that. Non-finetuned memory models often make performance worse. In the single-episode table, several non-finetuned Qwen and DeepSeek memory models reduce success rate, localization, and efficiency relative to the no-memory baseline. In cross-episode settings, the degradation can be even more severe. That should not surprise anyone who has deployed retrieval systems, but apparently the industry enjoys relearning the same lesson in more expensive ways.

Memory can harm an agent for several reasons. It can overfit to the previous task. It can preserve the wrong detail. It can add irrelevant instructions that compete with the issue description. It can make the agent prematurely confident about repository structure. It can nudge exploration toward a path that used to matter but no longer does. In software engineering, stale or misdirected context is not neutral. It consumes steps, attention, and budget.

The paper’s qualitative appendix supports this interpretation. Effective memories operate at a useful abstraction level: repository structure, workflows, conventions, and reusable patterns. Ineffective memories become over-specific, task-bound, repetitive, or focused on suboptimal information. That distinction is obvious after someone says it. Before measurement, it is just another prompt-engineering debate with nicer fonts.

The replacement belief should be: memory is not storage. Memory is validated compression.

What the main evidence actually shows

The paper’s main evidence comes from two evaluation regimes.

Single-episode memory generation tests whether memory distilled from one completed trajectory can immediately improve a later agent run. This isolates reflection quality. It asks: can the memory model extract something useful from a trajectory without relying on long-term accumulation?

Cross-episode memory evolution tests whether memory remains useful as it evolves across tasks in a repository. This is harder. The model must decide what to retain, what to update, and what to avoid carrying forward. In a real engineering environment, this is the more interesting setting. Teams do not want an agent that writes a diary after one issue. They want a repository-aware assistant that becomes less foolish over time.

The results are directionally consistent:

Evidence block	Likely purpose	What it supports	What it does not prove
Single-episode results	Main evidence	Finetuned memory models can improve success, localization, and efficiency from trajectory-derived memory.	That memory will remain useful across long-lived enterprise repositories.
Cross-episode results	Main evidence	Evolving memory can improve performance across ordered task sequences.	That the system handles arbitrary repo drift, private code, or non-Python ecosystems.
Non-finetuned memory comparisons	Main evidence and warning	Generic memory generation can degrade performance.	That every untrained memory feature is harmful; only that it must be tested.
Stage I versus Stage II training	Ablation	SFT and RL each help, with the combined two-stage approach performing best.	That this exact training recipe is globally optimal.
RL algorithm comparisons	Robustness test	The approach is not tied to one specific RL optimizer.	That RL implementation details are irrelevant. They never are.
Repository-wise analysis	Robustness test	Gains appear across multiple sampled repositories.	That repository coverage is broad enough for enterprise transfer.
Action-level versus episode-level memory	Sensitivity/design test	Updating memory at episode level is more efficient and effective than action-level evolution in this setup.	That no finer-grained memory update could ever work.
Qualitative examples	Exploratory extension	Effective memories abstract repository patterns; ineffective ones overfit or misdirect.	Quantitative proof of why every task succeeds or fails.

In single-episode memory augmentation, MemOp reports absolute gains of up to 5.25 percentage points in success rate and 4.63 points in resolve efficiency. The table also shows the important negative result: non-finetuned memory models frequently degrade performance. For example, under the Devstral-Small agent baseline, several non-finetuned memory models reduce success rate by multiple points, while finetuned variants turn positive.

In cross-episode memory augmentation, the reported gains are smaller but still meaningful: up to 3.00 points in success rate and 3.17 points in localization accuracy. That smaller magnitude is not a weakness by itself. Cross-episode memory is a harder problem because memory must remain coherent across tasks. If anything, the result is more operationally relevant because it tests whether memory can function as an evolving repository-level knowledge base rather than a one-off postmortem.

The paper also reports cost improvements. In the computation table, evaluation-time API cost per 100 instances drops from 11.73 to 10.26 dollars for the Devstral-Small setting and from 9.97 to 8.94 dollars for the Qwen3-Coder setting. The authors summarize the computational cost reduction as at least 9.79%. Cost reductions appear because better memory can help the agent localize and resolve faster, not because the memory model is free. Training and dataset construction still cost money. Reality remains rude.

The ablations are not side quests

The appendix and additional figures are easy to skim past, but they clarify the mechanism.

The training-stage ablation is especially important. Stage I supervised fine-tuning teaches the model what accepted memories look like. Stage II reinforcement learning further aligns memory generation with performance differences. The paper reports that each stage contributes, and the combined Stage I plus Stage II setup is stronger. That supports the claim that the loop is not merely filtering data once; optimization against validated utility matters.

The action-level versus episode-level memory comparison is also useful. One might assume that more frequent memory updates are better. After all, if memory is good, why not update it after every action? Because agents already suffer from context noise and long-horizon inefficiency, and action-level memory can amplify both. In the paper’s test, action-level memory evolution requires up to 11.82 times more average solving time and reduces success rate by 12.50 points, while episode-level evolution improves success rate by 2.75 points. This is a rare moment where “less frequent governance” is not laziness; it is architecture.

The preference batch-size test is another sensitivity check. The authors compare reinforcement learning with preference rollout batch sizes of $c = 2$ and $c = 4$. The larger batch produces more robust optimization across metrics, with success-rate gains reported at least 2.25 points for $c = 4$ versus at most 0.75 points for $c = 2$. The business translation is mundane but important: validation signal quality depends on rollout configuration. You cannot just name something “closed loop” and assume the loop has enough contrast to teach anything.

Repository-wise analysis shows gains across nine repositories and 90 instances, with only four degraded comparisons among 90 reported comparisons on localization efficiency and accuracy. That is encouraging, but it is still a robustness test inside a benchmark environment. It should increase confidence in the mechanism, not suspend disbelief.

The business value is governed reuse, not agent nostalgia

For software organizations, the paper’s value is not “agents can remember.” They already can, in the trivial sense. Any system can append text to a file. The question is whether remembered content improves work.

MemOp suggests a practical design pattern for coding-agent deployment:

Operational layer	MemOp-inspired principle	Business interpretation
Memory creation	Generate memories from completed trajectories.	Capture lessons from actual work, not from generic documentation alone.
Memory validation	Compare downstream performance with and without memory.	Treat memory as an intervention subject to A/B testing.
Memory rejection	Remove memories that are redundant or harmful.	Prevent context stores from becoming technical debt with markdown syntax.
Memory training	Fine-tune memory generation using accepted and rejected examples.	Improve the memory writer, not only the retrieval system.
Memory evolution	Update at episode level rather than every action.	Balance learning speed against noise, cost, and operational stability.
Local benchmarking	Evaluate on the organization’s own repositories and issue types.	Do not buy benchmark-shaped confidence and call it deployment readiness.

This is where the paper becomes relevant beyond SWE-Bench. Many enterprise AI programs already have logs, traces, tickets, pull requests, incident reports, and postmortems. The problem is not lack of historical material. The problem is that most historical material is not agent-compatible memory. It is too verbose, too stale, too local, too political, or too ambiguous.

MemOp’s deeper lesson is that memory should be selected by demonstrated effect on the task distribution. In a business setting, that task distribution might include bug fixes, test repairs, dependency upgrades, migration work, documentation changes, code review responses, or incident remediation. Each category may need different memory acceptance criteria.

For a coding-agent platform, useful memory governance might include:

Did memory improve issue resolution rate?
Did it reduce steps to identify relevant files?
Did it reduce tool calls, runtime, or API cost?
Did it reduce variance across repeated runs?
Did it preserve repository conventions without overfitting to old tasks?
Did it create new failure modes, such as stale-path fixation or unsafe edits?
Did it remain useful after dependency, architecture, or test-suite changes?

The annoying part is that this requires measurement. The useful part is also that this requires measurement.

What Cognaptus infers, and what the paper directly proves

The paper directly shows that MemOp improves selected software-engineering agents on sampled SWE-Bench-Verified tasks under the authors’ experimental setup. It also shows that non-finetuned memory generation can degrade performance, that two-stage training improves memory quality, and that episode-level memory evolution is preferable to action-level updating in their tested setting.

Cognaptus infers a broader operating principle: persistent memory should be managed as a validated capability, not a passive archive. That inference is reasonable because the failure mode is not unique to SWE-Bench. Enterprise agents often carry forward irrelevant context, stale assumptions, and over-specific lessons. A downstream-validated memory loop is a credible way to manage that risk.

What remains uncertain is the transfer curve. The paper does not establish that MemOp will work unchanged on private monorepos, multilingual codebases, legacy enterprise systems, non-Python stacks, security-sensitive workflows, or teams with unusual review and deployment conventions. It also does not remove the cost of generating validation data. The dataset construction process uses multiple rollouts and memory candidates; that is exactly what gives the signal value, and exactly why it is not free.

A sober adoption path would therefore look like this:

Decision question	Paper-supported answer	Enterprise answer still needed
Should coding-agent memory be evaluated by downstream outcomes?	Yes. That is the core contribution.	Which outcomes matter locally: pass rate, review burden, latency, incident risk, or cost?
Is generic memory generation enough?	Often no; non-finetuned memory can hurt.	Which existing memory features are quietly degrading local workflows?
Does training a memory model help?	Yes, in the tested settings.	Is local trajectory volume sufficient to train or adapt one safely?
Should memory update after every action?	Not in this paper’s test; episode-level evolution works better.	Are there narrow workflows where finer updates are justified?
Can benchmark gains justify procurement?	Only partially.	Local benchmark validation is still required. Sorry, procurement decks.

The most important limitation is not merely that the benchmark is limited to Python GitHub issue-style tasks, though it is. The more interesting limitation is that MemOp learns from agent trajectories. If the agent fails to explore the relevant part of the codebase, its memory may encode that blind spot. A memory system cannot reliably distill lessons from evidence the agent never collected.

The authors acknowledge this: memory optimization depends partly on the intrinsic problem-solving ability and behavior of the software-engineering agent. If the agent consistently searches the wrong files, runs the wrong commands, or misunderstands the repository architecture, then the memory model is learning from a distorted experience stream. The loop can reject harmful memories, but it cannot magically observe unvisited evidence.

This matters for business deployment. A weak coding agent plus optimized memory may become a more consistent weak coding agent. That is not transformation. That is operationalized mediocrity.

The second practical limitation is rollout sensitivity. The size and quality of the training dataset depend on configuration choices: number of repositories, tasks per repository, trajectories per task, memory candidates per trajectory, accepted versus rejected samples, preference batch size, and evaluation metrics. These are not incidental parameters. They are the machinery that creates the utility signal.

The third limitation is benchmark transfer. SWE-Bench-Verified is useful, but enterprise software work is not a single benchmark. Real repositories contain proprietary conventions, brittle integration tests, generated code, legacy build systems, permissions boundaries, security requirements, and humans who insist on having opinions. A memory system that improves benchmark issue resolution may still need substantial local adaptation.

The operator’s takeaway: make memory auditable or keep it small

The industry likes persistent memory because it sounds like progress. The agent remembers. The assistant becomes stateful. The system learns from experience. Very moving. Also insufficient.

The operational question is sharper: can the system prove that what it remembers improves future work?

MemOp’s answer is to close the loop between memory generation and downstream validation. That is the right direction. It turns memory into a measurable subsystem with acceptance criteria, rejection data, training signals, and cost accounting. The gains reported in the paper are meaningful but not magical. They are strongest as evidence for the mechanism: memory works when it is filtered and optimized against outcomes.

For business leaders, the lesson is not to demand “agent memory” as a feature. Demand memory utility as an evaluation object. Ask what gets stored, why it gets stored, how it is tested, when it is removed, and whether it improves the metrics your engineering organization actually cares about.

A memory feature without downstream validation is just a junk drawer with an API.

And software teams already have enough drawers.

Cognaptus: Automate the Present, Incubate the Future.

Xuehang Guo, Zora Zhiruo Wang, Qingyun Wang, Graham Neubig, and Xingyao Wang, “Enhancing Software Engineering Through Closed-Loop Memory Optimization,” arXiv:2606.05646, submitted June 4, 2026, https://arxiv.org/abs/2606.05646. ↩︎

TL;DR for operators#

The familiar failure: every issue starts from zero#

The mechanism: memory utility must be measured after use#

The loop: from trajectory to accepted memory to trained memory model#

The misconception: longer memory is not better memory#

What the main evidence actually shows#

The ablations are not side quests#

The business value is governed reuse, not agent nostalgia#

What Cognaptus infers, and what the paper directly proves#

The limitation that matters: memory can inherit the agent’s blind spots#

The operator’s takeaway: make memory auditable or keep it small#