Memory is a familiar word. That is exactly why it can mislead us.
When people hear that coding agents need “memory,” the first image is often a giant scrapbook: past prompts, previous patches, command logs, successful code snippets, failed attempts, and whatever else the agent has dragged behind it like a very confident intern with a messy backpack. More memory sounds safer. More traces sound more useful. More remembered work sounds like less repeated work.
That instinct is understandable. It is also incomplete.
The paper Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents studies a sharper question: can coding agents improve by borrowing memories from different coding domains, not just from prior tasks inside the same benchmark?1 The answer is yes, but the interesting part is not the average performance gain. The interesting part is what actually transfers.
The study suggests that the useful memory is usually not a copied solution, a clever algorithm, or a reusable code fragment. It is meta-knowledge: inspect before editing, preserve interfaces, test locally, avoid blind overwrites, respect output contracts, adapt to tooling constraints, and make small reversible changes. In other words, the agent improves less because it remembers what to write and more because it remembers how not to behave like a caffeinated Stack Overflow paste machine.
That distinction matters for any company trying to deploy coding agents beyond demos. Enterprise value will not come from dumping every agent transcript into a vector database and calling it organizational learning. It will come from deciding which experience should survive as reusable operational guidance, and which experience should be politely buried.
The paper’s core mechanism is experience transfer, not code reuse
The authors define Memory Transfer Learning as a setting where a coding agent retrieves memories generated from heterogeneous source domains and uses them while solving a new target task. The important experimental detail is that, when evaluating a benchmark, the memory pool excludes memories from that same benchmark. This prevents the study from merely showing that agents can reuse near-duplicate in-domain experience.
The evaluation spans six coding benchmarks:
| Benchmark family | Benchmarks used in the paper | What this tests |
|---|---|---|
| Function-level and competitive coding | LiveCodeBenchv6, Aider Polyglot | Whether memories help with compact programming tasks and language-diverse code edits |
| Repository and terminal engineering | SWE-Bench Verified, TerminalBench2 | Whether memories help agents navigate larger codebases, shell environments, and patch workflows |
| Scientific and ML coding | ReplicationBench, MLGym-Bench | Whether memories help with research-style coding tasks where environment, data, and evaluation routines matter |
The agent uses four memory formats, moving from low abstraction to high abstraction:
| Memory format | What it stores | Why it might help | Why it can hurt |
|---|---|---|---|
| Trajectory | Raw command and observation history | Concrete evidence of what happened | Brittle imitation of irrelevant commands |
| Workflow | Extracted reusable action sequence | A cleaner operational pattern | Still tied to source-task structure |
| Summary | Condensed account of task, actions, results, and analysis | Balances specificity and interpretation | May omit operational details needed later |
| Insight | Generalized lesson without source-task details | Transfers procedural guidance across domains | Can become too vague if poorly generated |
This design lets the paper separate two questions that are often blurred together. First, does cross-domain memory help? Second, what kind of memory is actually transferable?
The first answer is empirical. The second answer is operational.
The headline gain is useful, but the ranking of memory types is the real story
The main result reports Pass@3 performance. For GPT-5-mini, the zero-shot average across the six benchmarks is 0.523. Memory Transfer Learning improves the average for every memory type, but the more abstract formats perform better:
| Method | Average Pass@3 |
|---|---|
| Zero-shot | 0.523 |
| MTL with Trajectory memory | 0.534 |
| MTL with Workflow memory | 0.538 |
| MTL with Summary memory | 0.546 |
| MTL with Insight memory | 0.560 |
That final row is the paper’s headline: Insight memory improves the average by 3.7 percentage points over zero-shot for GPT-5-mini. The authors also report gains for other models: DeepSeek V3.2 improves by 2.6 points on average with Insight memory, and Qwen3-Coder-480B-A35B-Instruct improves by 1.8 points.
Those are not magical numbers. They are modest, and they should be read as benchmark improvements under the paper’s setup rather than as direct estimates of production ROI. Still, they matter because the gain appears across heterogeneous tasks and across multiple models. That is the first sign that the memories are not merely leaking task-specific code.
The ranking matters more than the headline. If memory were mainly useful as a store of concrete solutions, Trajectory should be the hero. It is not. Insight performs best; Summary comes next; Workflow and Trajectory trail behind. The agent benefits most when prior experience has been abstracted away from its original task.
That is the mechanism-first reading of the paper: transfer does not come from preserving detail. Transfer comes from removing the wrong detail.
What transfers is mostly working discipline
The authors inspect cases where the zero-shot agent failed but the memory-augmented agent succeeded, then categorize the contribution of transferred memories. The largest categories are not “remembered algorithms.” They are behavioral and procedural.
| Memory benefit category | Share reported in the paper’s Figure 3 | Practical interpretation |
|---|---|---|
| Iterative workflow discipline | 15.0% | Inspect, edit, verify, repeat instead of attempting one heroic patch |
| Test-driven verification | 14.5% | Build local checks, reproduction scripts, or smoke tests before trusting the fix |
| Anti-pattern avoidance | 14.4% | Avoid known bad moves such as blind overwrites or brittle hardcoding |
| Input validation and robustness | 10.4% | Handle edge cases and heterogeneous inputs more carefully |
| API and interface compliance | 9.5% | Preserve expected signatures, schemas, and framework contracts |
| Interaction protocol adherence | 8.5% | Respect benchmark or tool-specific completion and formatting rules |
| Environmental adaptation | 8.1% | Cope with missing packages, toolchain quirks, shells, and runtime constraints |
| File and syntax management | 7.8% | Use safer file manipulation and quoting patterns |
| Repository exploration tactics | 6.4% | Locate relevant files and dependencies before editing |
| Algorithmic strategy transfer | 5.5% | Transfer actual algorithms or data-structure choices |
This table is the part a business reader should not skip. The model did not become better mainly because it imported brilliant programming ideas from another benchmark. The explicitly algorithmic category is the smallest one reported. The agent became better because the retrieved memories nudged it toward safer engineering behavior.
A case study makes this concrete. A memory generated from LiveCodeBench tells the agent to create quick self-contained tests using an inline Python here-doc to validate fixes. The target task is in SWE-Bench Verified, involving a Django codebase. The useful transfer is not Django knowledge and not the exact code from the source task. It is a validation habit: when making a small code fix, create a minimal check that exercises the behavior before declaring victory.
That is a mundane lesson. It is also exactly the kind of lesson that prevents expensive agent failures. Apparently, “run a small test before you celebrate” remains a breakthrough technology.
Abstraction wins because it gives direction without handcuffs
The paper’s most important design principle is that abstraction governs transferability. The authors support this in three ways.
First, the embedding visualizations show a pattern. Task embeddings cluster by benchmark. Workflow memories still show benchmark-level structure. Summary memories become more mixed. Insight memories are sparse and intermingled across benchmarks. The visual purpose here is not to prove performance by itself; it is to show that higher-abstraction memories are less tied to their source domain.
Second, the performance ranking follows the same direction: Insight outperforms Summary, Summary outperforms Workflow, and Workflow outperforms Trajectory on the GPT-5-mini average. The evidence is not just aesthetic clustering. The more generalized representation also works better.
Third, the authors run a more controlled test inside the Insight format. They ask an LLM to infer the original task from each Insight memory. If the original task is easy to infer, the memory is treated as more task-specific; if it is hard to infer, the memory is treated as more task-agnostic. They then compare the top 30% task-specific Insights against the bottom 30% task-agnostic Insights.
| Insight subset | LiveCodeBench | SWE-Bench Verified | ReplicationBench | Average |
|---|---|---|---|---|
| Task-specific Insights | 0.887 | 0.617 | 0.067 | 0.523 |
| Task-agnostic Insights | 0.893 | 0.627 | 0.082 | 0.534 |
| Difference | +0.6 points | +1.0 point | +1.5 points | +1.1 points |
This is not the main headline result. It is an ablation-style test aimed at isolating whether abstraction itself matters, rather than merely the label “Insight.” The result supports the mechanism: even within the same memory format, the more task-agnostic memories transfer better.
The practical lesson is precise. A useful memory should tell the agent what kind of reasoning or procedure to apply, while leaving implementation details open. “Inspect the evaluation protocol before optimizing the model” is transferable. “Run this exact command with this exact library argument from another task” is a trap wearing a name tag.
Low-level memories fail by becoming anchors
The paper does not present memory as a free upgrade. It explicitly examines negative transfer: cases where zero-shot succeeds but memory-augmented inference fails. The failure modes are worth taking seriously because they are exactly the failure modes that would appear in enterprise agent systems.
The authors identify three major categories:
| Negative transfer mode | What happens | Business analogue |
|---|---|---|
| Domain-mismatched anchoring | A superficially similar memory pushes the agent toward the wrong assumptions | A frontend fix pattern contaminates a backend migration task |
| False validation confidence | A memory about testing causes the agent to trust weak checks | A smoke test passes while the formal acceptance criteria fail |
| Misapplied best-practice transfer | A generally good pattern is used where the new task needs different semantics | “Always refactor” becomes “break the stable legacy interface” |
The appendix examples are almost painfully realistic. In one case, a memory from an R workflow encourages the agent to overwrite files in a C++ project without properly checking the existing structure and namespaces. In another, an Insight about pre-flight verification gets distorted into an excuse for a short smoke test when the task actually requires producing high-quality trained checkpoints.
The problem is not that the remembered principle is always bad. The problem is that the agent adapts it badly. Memory is not knowledge unless the agent can decide where the analogy ends.
That is why raw traces are dangerous. A trajectory carries many source-domain assumptions: file layout, command syntax, package versions, evaluation routines, language conventions, and one-off debugging hacks. When retrieved into the wrong target context, those details can become anchors. The agent does not just remember; it imitates.
For production systems, this means logging everything is not the same as learning. A full transcript may be valuable for audit, debugging, or postmortem analysis. But as retrieved guidance for future agents, it can be too specific. The memory layer needs distillation.
The comparison with self-evolving baselines is about memory quality, not just memory quantity
The paper compares Memory Transfer Learning with two self-evolving approaches, ReasoningBank and AgentKB, on three benchmarks: LiveCodeBenchv6, SWE-Bench Verified, and ReplicationBench.
| Method | Number of memories | LCB | SWE-Bench Verified | ReplicationBench | Average |
|---|---|---|---|---|---|
| Zero-shot | — | 0.910 | 0.730 | 0.111 | 0.584 |
| ReasoningBank | 97 | 0.920 | 0.750 | 0.133 | 0.601 |
| AgentKB | 5,899 | 0.920 | 0.720 | 0.200 | 0.613 |
| MTL | 431 | 0.930 | 0.770 | 0.189 | 0.630 |
This test is a comparison with prior memory-based systems, not a proof that the proposed method dominates every possible memory architecture. The setup is narrower: three benchmarks, Pass@3, and the authors’ implementation choices. Still, the result is useful because it pushes against a common intuition: more memories are automatically better.
AgentKB uses roughly 5.8k memories and still underperforms the paper’s MTL setup with 431 memories. The lesson is not that small memory pools always beat large ones. The paper later shows that larger and more diverse memory pools can help. The lesson is that memory quality and domain relevance matter. A large pool of weakly relevant memories can become an expensive distraction engine.
For companies, this is the difference between a knowledge layer and a landfill.
The scaling result says diversity helps, but retrieval remains the bottleneck
The paper also tests how performance changes with memory pool size and number of source domains. These are sensitivity tests, not a second thesis. Their purpose is to check whether the gains depend on a tiny hand-picked pool or whether broader memory coverage helps.
The reported pattern is positive: average performance improves as the memory pool grows, and using more source domains generally increases the gain, with nine domains giving the best overall performance in the authors’ test. The interpretation is intuitive. A larger and more diverse pool increases the chance that retrieval finds useful meta-knowledge for a new task.
But the same section should make operators cautious. More memory also increases the chance of retrieving a misleading analogy. The paper’s negative transfer analysis and retrieval experiments make that boundary clear.
The retrieval comparison is especially interesting:
| Retrieval method | LCB | SWE-Bench Verified | ReplicationBench | Average |
|---|---|---|---|---|
| No memory | 0.910 | 0.730 | 0.111 | 0.584 |
| LLM reranking | 0.920 | 0.730 | 0.144 | 0.598 |
| Adaptive rewriting | 0.920 | 0.760 | 0.144 | 0.608 |
| Embedding similarity | 0.930 | 0.770 | 0.189 | 0.630 |
The more elaborate retrieval strategies underperform simple embedding similarity in this experiment. That does not mean reranking and rewriting are useless in general. It does mean static retrieval methods may struggle in dynamic agent settings, where the useful memory may depend on intermediate observations that are not visible at the initial prompt.
A coding agent does not know everything it needs at the beginning. It discovers the repository structure, error messages, hidden constraints, package versions, and test behavior along the way. A retrieval method that selects three memories before the agent has touched the environment is making a very early bet. Sometimes that bet is good. Sometimes it retrieves a beautifully irrelevant memory and the agent salutes.
Cross-model memory transfer is promising, but self-generated memories still fit best
The paper tests whether memories generated by one model can help another model. This matters because enterprise deployments rarely run a single model forever. Teams may use a frontier model for complex tasks, an open model for cost-sensitive workflows, and smaller models for routine automation.
The cross-model tests use Average Pass@1 results on LiveCodeBench, SWE-Bench Verified, and ReplicationBench. The broad finding is that cross-model memories can beat zero-shot baselines. Memories generated by GPT-5-mini can help DeepSeek V3.2 and Qwen3-Coder; memories from other models can also help GPT-5-mini.
But self-generated memories generally perform best. For example, GPT-5-mini’s own memories produce an average of 0.543, while memories from DeepSeek V3.2 and Qwen3-Coder produce 0.518 and 0.528 respectively against a zero-shot baseline of 0.515. DeepSeek V3.2 similarly performs best with its own memories. Qwen3-Coder ties in average between GPT-5-mini-source and self-source memories in the reported table, but the broader pattern still supports model-specific fit.
This is an exploratory extension. It supports the claim that meta-knowledge is partly model-agnostic, but it also warns that memory style may encode model-specific habits. A memory generated by one model may assume a planning style, verbosity level, tool-use pattern, or risk tolerance that another model does not naturally execute.
For business use, this suggests a layered design. Some memory should be shared across models: environment constraints, interface rules, validation routines, deployment conventions, and known anti-patterns. Other memory may need model-specific adaptation: how much instruction detail the model needs, which tools it tends to overuse, when it requires stronger guardrails, and which failure modes recur in its own traces.
What this changes for enterprise coding-agent design
The paper directly shows that cross-domain memory can improve benchmark performance for coding agents, especially when memories are abstract and insight-like. It directly shows that transferred value is mostly procedural meta-knowledge rather than task-specific code. It also directly shows that memory can harm performance when retrieval or adaptation goes wrong.
The business inference is that agent memory should be treated as an operational knowledge system, not as chat-history persistence.
A practical enterprise memory layer should probably separate at least four stores:
| Memory store | What belongs there | How it should be used |
|---|---|---|
| Audit trace | Full prompts, commands, diffs, outputs, tool logs | Compliance, debugging, incident review; not automatically retrieved as guidance |
| Distilled insight | General lessons from successful and failed agent runs | Retrieved into future tasks as high-level procedural guidance |
| Environment rulebook | Stack-specific constraints, package quirks, CI requirements, deployment conventions | Retrieved when the current repository or runtime matches the environment |
| Failure anti-patterns | Known bad behaviors and their consequences | Used as guardrails, especially before risky edits or final submission |
The distinction matters because each store has a different risk profile. Audit traces need completeness. Retrieved memories need selectivity. Environment rules need freshness. Anti-patterns need severity labels. Combining them all into one undifferentiated vector soup is tempting, cheap, and exactly the sort of thing that produces impressive demos until the agent confidently ports an R file-writing trick into a C++ header.
The paper also implies a shift in ROI thinking. The value of memory is not just cheaper inference or fewer repeated prompts. The more important value is lower failure probability in long-horizon coding tasks: fewer broken interfaces, fewer invalid patches, fewer fake validations, fewer environment errors, and fewer “it worked in the agent’s imagination” moments.
That is not glamorous. It is also where most production losses live.
A useful evidence map for operators
Not every result in the paper should be used the same way. Some findings are main evidence; others are ablations, robustness checks, or exploratory extensions. Mixing them together makes the paper sound more certain than it is.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Six-benchmark Pass@3 evaluation | Main evidence | Cross-domain memory can improve coding-agent performance | Production ROI across all enterprise repositories |
| Four memory-format comparison | Main mechanism evidence | More abstract memories transfer better than raw traces | That Insight is the final optimal memory schema |
| Benefit-category analysis | Mechanism interpretation | Gains mostly come from procedural meta-knowledge | Exact causal shares in every deployment setting |
| Task-specific vs task-agnostic Insight split | Ablation-style isolation | Abstraction itself contributes to transfer effectiveness | That all details should always be removed |
| Pass@1 appendix results | Robustness/sensitivity check | Gains are not limited to Pass@3 only | That gains are large under every metric |
| Memory pool and domain scaling | Sensitivity test | More diverse pools can improve average performance | Unlimited scaling without retrieval risk |
| Cross-model transfer | Exploratory extension | Some memories are model-agnostic | Full portability of memory between model families |
| Retrieval method comparison | Implementation analysis | Static reranking/rewriting may underperform in agent settings | That simple embedding retrieval is always best |
For enterprise readers, this table is the antidote to both hype and dismissal. The paper is stronger than “memory helps a bit.” It is weaker than “just add a memory pool and your agents are reliable now.” The useful middle ground is where systems actually get built.
The deployment boundary: memory needs governance before scale
The paper’s limitations are not decorative. They affect how the result should be used.
First, the experiments are benchmark-based. The authors sample tasks and use benchmark evaluation protocols. That is appropriate for research, but business repositories have messier objectives: customer-specific constraints, legacy code, security policies, non-deterministic tests, proprietary build systems, and humans who become upset when an agent “almost” preserves an API.
Second, the memory generation process relies on LLM-generated representations and an LLM judge. That is practical, but it means memory quality depends on the model’s ability to summarize success and failure correctly. A bad memory is worse than no memory because it arrives with the authority of experience.
Third, retrieval happens at the beginning of inference. In real software work, relevance often emerges after inspection. The agent may need step-wise retrieval: initial memories for broad discipline, then repository-specific memories after it discovers the stack, error messages, and test layout.
Fourth, abstraction is not the same as vagueness. A memory such as “be careful and test things” is technically abstract and operationally useless. The best memories in this framing are specific about procedure but abstract about source-task details. They say what to do next without pretending the current task is the old task.
That is the governance challenge. A memory layer needs quality controls: source task, success/failure label, abstraction level, environment tags, confidence score, age, known failure modes, and retrieval history. It also needs pruning. Stale memory is just technical debt with a vector embedding.
The better memory system is a learning organization in miniature
The most useful business reading of this paper is not that coding agents need larger memories. It is that they need better-shaped memories.
Human engineering teams already know this. A useful postmortem does not say, “Here is every terminal command Bob ran at 2:13 a.m.” It says: check the migration order before touching the schema; preserve the public interface; reproduce the failure locally; do not trust the mock service for the billing path; document the rollback path before deployment. That is experience made portable.
Memory Transfer Learning brings that same idea into coding agents. The agent’s past work becomes valuable only after it is transformed from event history into operational principle.
So the real takeaway is not “coding agents should remember more.” That is the shallow version. The better takeaway is this:
Coding agents should remember less raw behavior, more transferable discipline, and exactly enough context to know when a lesson no longer applies.
The companies that understand this will not treat memory as a feature checkbox. They will treat it as infrastructure: part knowledge management, part retrieval system, part safety layer, part institutional habit. The companies that do not will build agents with enormous autobiographies and very little judgment.
Which, to be fair, is also a recognizable management style.
Cognaptus: Automate the Present, Incubate the Future.
-
Kangsan Kim, Minki Kang, Taeil Kim, Yanlai Yang, Mengye Ren, and Sung Ju Hwang, “Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents,” arXiv:2604.14004v1, 15 April 2026, arXiv HTML version. ↩︎