Agents With Memory: Turning Execution Logs into Institutional Knowledge

Logs are where automation failures usually go to become archaeology.

A business deploys an AI agent. The agent calls APIs, checks intermediate states, makes assumptions, retries after errors, occasionally succeeds by accident, and sometimes discovers a genuinely efficient route through a workflow. The full execution trace is stored somewhere. In theory, this is valuable evidence. In practice, it often becomes a swamp: too verbose for managers, too unstructured for engineers, and too raw for the next agent run.

The arXiv paper Trajectory-Informed Memory Generation for Self-Improving Agent Systems argues that this is the wrong ending for agent logs.¹ The paper’s central idea is simple but operationally important: an agent should not merely remember what happened. It should convert what happened into structured guidance about what to repeat, what to avoid, and when the lesson applies.

That distinction matters. A vector database full of previous traces is not institutional knowledge. It is a filing cabinet with embeddings. Useful agent memory needs diagnosis.

The paper proposes a framework that turns execution trajectories into typed “tips”: strategy tips from clean successes, recovery tips from failure-then-recovery sequences, and optimization tips from successful but inefficient runs. These tips are then generalized, consolidated, stored with metadata and provenance, retrieved for future tasks, and inserted into the agent prompt before execution.

So the paper is not really about giving agents a diary. It is about giving them a postmortem system that writes concise operating notes for the next shift.

The mechanism is the contribution: logs become diagnosis, then guidance

The paper’s real contribution is not a single memory component. It is the full conversion pipeline:

raw execution trajectory
→ semantic analysis of reasoning and actions
→ causal attribution of failures, recoveries, and inefficiencies
→ typed learning tips
→ generalized and consolidated memory entries
→ contextual retrieval at runtime
→ improved future execution

This is why “memory” is a slightly misleading word here. Memory sounds passive. The framework is active: it inspects, attributes, abstracts, curates, and retrieves.

The authors describe four technical components, but operationally the system works in three phases.

Phase	What happens	Business translation
Trajectory analysis and tip extraction	The system reads completed agent traces, classifies reasoning patterns, identifies outcomes, and extracts strategy, recovery, or optimization tips.	Execution logs become structured lessons instead of unreadable transcripts.
Tip storage and management	Tips are generalized, clustered, deduplicated, merged, and stored with embeddings plus metadata such as category, source trajectories, task context, and priority.	Memory becomes governable: searchable, auditable, and less redundant.
Runtime retrieval	A future task retrieves relevant tips using cosine similarity or LLM-guided selection, then injects them into the prompt as guidelines.	The next agent run receives targeted operating guidance before it starts making expensive little mistakes.

The paper’s framing is useful because it corrects a common misconception: agent memory is not just a vector store attached to a chatbot. That kind of memory is good for remembering user preferences, entities, facts, and relationships. But an enterprise agent executing workflows needs procedural and diagnostic memory. It needs to know that a checkout operation failed because payment prerequisites were not verified, that retrying without fixing credentials is pointless, or that removing cart items one by one is a charmingly inefficient way to burn API calls.

The system therefore treats trajectories as evidence. A trajectory is not merely a transcript of actions; it is a record of reasoning quality.

Why raw trajectories are too crude for enterprise memory

The paper identifies a practical problem that many agent demos politely ignore: valuable learning does not come only from failures.

A failed execution may teach what to avoid. A recovered execution may teach how to detect and fix a problem. A clean success may teach a reusable strategy. A successful but clumsy execution may teach an optimization opportunity. These are different kinds of knowledge.

Putting all of them into the same “memory” bucket is like storing incident reports, training manuals, and cost-reduction ideas in one spreadsheet column named notes. Technically possible. Not a management system.

The authors separate learning into three tip categories:

Tip type	Source trajectory	What it captures	Example business meaning
Strategy tip	Clean success	A pattern worth repeating	“Verify all prerequisites before starting checkout.”
Recovery tip	Failure followed by correction	A failure signal and corrective sequence	“If authentication fails, re-check app-specific credentials before retrying.”
Optimization tip	Successful but inefficient execution	A cheaper or cleaner way to complete the task	“Use a bulk operation instead of looping through item-level calls.”

This categorization is more than neat labeling. It changes how the memory can be used. A recovery tip should surface when a task is likely to encounter an error pattern. A strategy tip should appear before a complex multi-step workflow. An optimization tip may matter more when cost, latency, or API quotas are important.

The paper also emphasizes provenance. Each memory entry links back to the trajectory or trajectories from which it was derived. That is boring in the best possible way. Without provenance, memory becomes folklore: “the agent says this is a good idea, but nobody knows why.” With provenance, teams can audit guidance, investigate bad tips, and test whether similar failures decline after the tip is deployed.

Enterprise AI needs more boring systems like this.

Causal attribution is the difference between logging and learning

One of the strongest parts of the framework is decision attribution.

Raw logs show sequence. They do not automatically show cause. If an agent fails at step 15, the decisive mistake may have occurred at step 3, when it assumed an API existed, skipped prerequisite verification, or used the wrong account context. A normal log viewer can show the corpse. It does not always identify the murder weapon.

The paper’s Decision Attribution Analyzer tries to trace outcomes back through the reasoning chain. For failures, it distinguishes immediate, proximate, and root causes. For recoveries, it identifies what went wrong, how the agent noticed, what corrective action worked, and why. For inefficiencies, it asks what made the execution suboptimal and what better alternative existed. For successes, it identifies which strategies contributed to clean completion.

This is the mechanism that makes the memory entries actionable. “The task failed” is not a lesson. “The agent attempted checkout before verifying payment configuration; future checkout tasks should call the payment-method API before checkout” is a lesson.

The difference is not semantic decoration. It is the difference between a dashboard and an improvement loop.

Subtask-level memory is where reuse becomes plausible

The paper compares two extraction granularities: task-level tips and subtask-level tips.

Task-level tips treat the whole execution as one unit. They capture end-to-end workflow strategies. This is useful when the future task resembles the original task closely.

Subtask-level tips first segment the trajectory into logical phases: authentication, data retrieval, pagination, processing, validation, task completion, and so on. Tips are then extracted for each subtask. This matters because many enterprise workflows look different at the surface but reuse the same internal moves.

A Spotify recommendation task and a phone alarm task are not semantically similar as user requests. But both may involve authentication, API discovery, state retrieval, filtering, and final update. If memory remains at the whole-task level, the system may miss this transfer. If memory is decomposed into subtasks, an authentication lesson from one app can help another app.

This is the paper’s most practical design choice. Enterprise process automation is rarely a set of identical tasks. It is a messy family of workflows sharing recurring operational fragments. Good memory should recognize the fragments.

However, this also creates a retrieval problem. Subtask-level memory is more reusable, but it can also retrieve different subsets of guidance for task variants that should be handled consistently. The evidence section shows this tradeoff clearly.

The evidence: consistency improves more than individual task completion

The paper evaluates the framework on AppWorld, a benchmark where agents complete realistic tasks across applications such as e-commerce, email, calendar, and file management. The agent interacts with APIs and is evaluated by programmatic checks of final application state. The experiments use a simplified ReAct-style agent, and both the agent and tip extraction pipeline use GPT-4.1.

The two key metrics are:

Task Goal Completion (TGC): whether an individual task passes all tests.
Scenario Goal Completion (SGC): whether all variants within a scenario succeed.

SGC is stricter. It punishes sporadic brittleness. A system can do reasonably well on individual tasks while still failing to behave consistently across related variants. In business terms, SGC is closer to “will this process automation behave reliably across the family of cases we actually care about?”

On the held-out test-normal partition, the strongest configuration for scenario consistency is subtask-level tips with LLM-guided selection.

Configuration	TGC	Δ TGC	SGC	Δ SGC
Baseline, no memory	69.6	—	50.0	—
Subtask tips + LLM-guided selection	73.2	+3.6 pp	64.3	+14.3 pp
Subtask tips + cosine retrieval, $\tau \ge 0.6$	73.8	+4.2 pp	57.1	+7.1 pp
Task-level tips + cosine retrieval, $\tau \ge 0.6$	72.0	+2.4 pp	62.5	+12.5 pp

The first interpretation is straightforward: memory helps. But the more interesting interpretation is that memory helps consistency more than raw task completion.

The best TGC result is actually subtask-level tips with cosine retrieval: 73.8%, a +4.2 percentage point gain. But the best SGC result is subtask-level tips with LLM-guided selection: 64.3%, a +14.3 percentage point gain. That is not a minor distinction. It suggests that completing individual tasks and behaving consistently across related variants are not the same optimization target.

For enterprise deployment, this matters. A customer-service automation that succeeds on many one-off tickets but fails unpredictably across similar refund cases is not “almost working.” It is a compliance meeting waiting for a calendar invite.

The effect is especially strong on difficult tasks. For Difficulty 3 tasks, the baseline SGC is 19.1%. With subtask-level tips and LLM-guided selection, it rises to 47.6%, a +28.5 percentage point increase, which the paper reports as a 149% relative gain. The TGC improvement on the same difficulty level is smaller: 54.0% to 58.7%, or +4.7 points.

This tells us where the memory is doing most of its work. It is not magically making all hard tasks easy. It is reducing variant-level brittleness in complex multi-step workflows, especially where planning, prerequisite verification, and error recovery matter.

The ablations say retrieval design is not plumbing

The paper’s configuration comparisons are not just leaderboard decoration. They clarify which design choices matter and why.

Test or comparison	Likely purpose	What it supports	What it does not prove
Subtask tips + LLM-guided selection vs. baseline on held-out test-normal	Main generalization evidence	Memory improves unseen-task performance, especially SGC.	It does not prove the approach works across all models or open-ended enterprise environments.
Task-level cosine retrieval with different thresholds and top-$k$ settings	Retrieval sensitivity test	Retrieval strictness can hurt; $\tau \ge 0.6$ outperforms $\tau \ge 0.5$ and top-3 restriction in aggregate.	It does not establish a universal threshold for every embedding model or domain.
Subtask cosine vs. subtask LLM-guided retrieval	Retrieval strategy ablation	Cosine can slightly improve individual task completion, while LLM-guided retrieval improves scenario consistency.	It does not prove LLM-guided retrieval is always worth the cost.
Task-level cosine vs. subtask cosine	Granularity comparison	Subtask tips improve TGC; task-level tips can preserve more consistent whole-task behavior under simple retrieval.	It does not mean task-level memory is superior overall.
Train and dev source partitions	Recurring-task evidence	Tips help more when future tasks resemble the source trajectories.	These results are not as strong as held-out generalization evidence.

The cosine retrieval sensitivity is particularly useful. A task-level cosine configuration with $\tau \ge 0.5$ and top-3 restriction performs below baseline: 66.7% TGC and 48.2% SGC, compared with baseline 69.6% and 50.0%. For Difficulty 3, TGC falls from 54.0% to 46.0%.

That result is a quiet warning label for agent memory products: retrieval is not harmless. Badly selected memory can distract an agent or deprive it of relevant guidance. More memory is not automatically better. Narrow top-$k$ filtering can exclude useful lessons. Too loose a similarity threshold can admit noise. Too strict a threshold can miss paraphrased equivalents.

The best task-level cosine setting in the paper is $\tau \ge 0.6$ without top-$k$ restriction, reaching 72.0% TGC and 62.5% SGC. Lowering the threshold to $\tau \ge 0.5$ without top-$k$ gives weaker aggregate results: 70.2% TGC and 57.1% SGC. The paper interprets this as a signal-noise tradeoff: broader retrieval sometimes helps complex tasks, but it can damage medium-difficulty tasks by introducing irrelevant tips.

This is exactly the kind of boring deployment parameter that later becomes a production incident. It deserves attention before someone names it “context intelligence” and triples the price.

The business value is operational learning, not model mysticism

The practical pathway from the paper to business use is clear:

Capture full execution trajectories from deployed agents.
Analyze the reasoning and action sequence after completion.
Attribute success, failure, recovery, and inefficiency to specific decisions.
Convert those diagnoses into typed, actionable tips.
Generalize and consolidate tips so memory does not become a junk drawer.
Retrieve relevant guidance for future tasks.
Track whether similar failures decline after the guidance is introduced.

This turns agent operations into a learning system. Not in the vague “AI gets smarter over time” sense, but in the more concrete “the workflow team stops rediscovering the same avoidable failure every Wednesday” sense.

For enterprises, the most immediate use cases are process-heavy domains with repeated but variable workflows:

Business setting	Why trajectory-informed memory fits
Customer support agents	Similar tickets vary in details but share authentication, eligibility, refund, escalation, and follow-up subtasks.
Finance operations	Invoice matching, reconciliation, approval routing, and exception handling generate repeated recoverable failure patterns.
HR and internal service desks	Employee requests often share policy checks, identity verification, document retrieval, and case closure steps.
Sales operations	CRM updates, lead enrichment, quote preparation, and follow-up tasks reuse common API and validation patterns.
IT operations	Incident triage and remediation depend heavily on recognizing failure signals and applying tested recovery sequences.

The ROI logic is also more specific than “better agents.” The paper points toward three measurable benefits.

First, fewer repeated failures. If an agent repeatedly forgets a prerequisite, the system should generate a strategy or recovery tip that prevents the same pattern.

Second, more consistent completion across task variants. The SGC improvements are especially relevant here. Business teams usually care less about an agent’s average demo success and more about whether it behaves reliably across the messy variations users actually submit.

Third, lower debugging cost. Provenance means every tip can be traced back to source trajectories. That makes it easier to audit guidance, remove harmful tips, and understand why the agent behaved a certain way.

This is not a replacement for process design. It is a way to make process design learn from execution evidence.

What businesses should not overread

The paper is promising, but its boundaries are important.

The evaluation is conducted on AppWorld-style API tasks. That is a good environment for studying agent execution because tasks have structured actions and automated evaluation. But it is still not the same as a messy enterprise deployment with changing APIs, incomplete permissions, ambiguous user intent, human approvals, regulatory constraints, and legacy systems that behave like they were designed during a power outage.

The experiments also use GPT-4.1 for both the agent and the extraction pipeline. The paper notes future evaluation with other models, but the reported results should not be assumed to transfer unchanged to weaker, cheaper, or specialized models. Tip quality depends on the extractor’s ability to understand trajectories and perform causal attribution. Retrieval quality depends on embeddings, metadata, thresholds, and the agent’s sensitivity to injected guidance.

There is also a cost tradeoff. LLM-guided retrieval improves SGC in the held-out setting, but it adds an LLM call at runtime. For high-volume workflows, that cost and latency may be justified only for complex tasks, regulated processes, or cases where consistency matters more than speed. For simpler workflows, cosine retrieval may be sufficient. For very simple workflows, the paper even shows a possible interference effect: on the train partition, Difficulty 1 baseline performance is already 100% TGC and 100% SGC, while the memory-enhanced version scores lower.

That last point is worth pausing on. Memory can help agents. It can also get in the way. An agent that already knows how to do a simple task may not need extra advice from the corporate memory department. We have all met that department.

A practical implementation frame for enterprise teams

For teams building agent platforms, the paper suggests a useful design principle: separate experience capture, lesson generation, and runtime guidance.

Do not simply append previous traces to the prompt. Do not merely store all logs in a vector database and hope retrieval performs adult supervision. Instead, build a memory pipeline with explicit checkpoints.

Layer	Design question	Failure mode if ignored
Capture	Are trajectories complete enough to reconstruct reasoning, actions, observations, and outcomes?	The system cannot diagnose causes, only observe symptoms.
Attribution	Can the system identify which decisions caused failure, recovery, or inefficiency?	Tips become generic advice rather than operational guidance.
Tip generation	Are tips typed, actionable, scoped, and trigger-conditioned?	Memory becomes vague: “be careful,” “check things,” “try again.”
Consolidation	Are duplicate, conflicting, and overly specific tips merged or removed?	The memory store grows into noisy institutional clutter.
Retrieval	Are tips selected by task context, category, priority, and relevance?	Agents receive irrelevant guidance and follow it obediently, because of course they do.
Provenance	Can each tip be traced back to source executions?	Bad guidance becomes hard to audit or correct.
Evaluation	Are TGC-like and SGC-like metrics tracked separately?	Teams optimize one-off success while missing consistency failures.

The most important business takeaway is not that every agent needs a large memory system. It is that agent memory should be designed as an operational learning loop. The loop should answer three questions after every meaningful execution:

What happened?
Why did it happen?
What should the next agent do differently in a similar context?

If the system cannot answer the third question, it has storage, not learning.

The paper’s best insight is institutional, not just technical

The title of the paper is about self-improving agents, but the deeper business idea is institutional knowledge.

Companies already know that experience is valuable. They write standard operating procedures, postmortems, playbooks, checklists, runbooks, and escalation guides. The problem is that AI agents generate execution experience at a pace humans will not manually summarize. A production agent might create thousands of trajectories before anyone has time to review the first hundred.

Trajectory-informed memory is a way to automate the first draft of institutional learning. It does not eliminate human governance. It makes governance possible at scale by turning raw traces into reviewable, typed, source-linked guidance.

That is the right mental model. Not “the agent remembers everything.” Not “the vector database solves memory.” Not “the model improves itself,” which is usually where precision goes to die.

A better description is this: the agent system builds an operational knowledge base from its own work history, then uses that knowledge base to reduce repeated mistakes and improve consistency.

That is less magical. It is also more useful.

Conclusion: memory should earn its place in the prompt

The paper’s strongest result is not the +3.6 percentage point TGC gain on held-out tasks, although that is useful. It is the +14.3 point SGC gain, and especially the +28.5 point SGC gain on hard tasks. Those numbers suggest that trajectory-informed memory is most valuable when tasks are complex, multi-step, and brittle across variants.

That is exactly where enterprise agents tend to disappoint.

The practical lesson is clear. Agent memory should not be treated as a passive archive. It should be a disciplined pipeline for turning execution evidence into actionable, retrievable, auditable guidance. The system must know the difference between a successful strategy, a recovery pattern, and an optimization opportunity. It must retrieve guidance carefully. And it must accept that sometimes the best memory is no memory at all, especially when the task is simple and the agent already knows what it is doing.

Execution logs are not institutional knowledge by default. They become institutional knowledge only after diagnosis, abstraction, curation, and disciplined reuse.

That is the useful version of agent memory. Everything else is just a very expensive scrapbook.

Cognaptus: Automate the Present, Incubate the Future.

Gaodan Fang, Vatche Isahagian, K. R. Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum, and Gegi Thomas, “Trajectory-Informed Memory Generation for Self-Improving Agent Systems,” arXiv:2603.10600v1, 2026. https://arxiv.org/html/2603.10600 ↩︎

The mechanism is the contribution: logs become diagnosis, then guidance#

Why raw trajectories are too crude for enterprise memory#

Causal attribution is the difference between logging and learning#

Subtask-level memory is where reuse becomes plausible#

The evidence: consistency improves more than individual task completion#

The ablations say retrieval design is not plumbing#

The business value is operational learning, not model mysticism#

What businesses should not overread#

A practical implementation frame for enterprise teams#

The paper’s best insight is institutional, not just technical#

Conclusion: memory should earn its place in the prompt#