The Memory Isn’t Broken — It’s Flat: Why LLMs Need to ‘Draw’ to Remember

Memory is usually sold as a storage problem.

Give the agent a vector database. Add a recall layer. Save summaries. Search harder. Expand the context window until the budget department starts making eye contact.

Then ask the agent a simple question: what changed after the earlier conversation?

That is where the polite demo often turns into a fog machine.

A recent paper, “Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents,” argues that the weakness is not merely that AI agents forget. It is that they remember in the wrong shape.¹ Most agent memory systems store facts as flattened records: useful for looking up isolated details, weak for reconstructing time, sequence, conflict, and change. The paper’s core move is almost offensively simple: when an agent stores a fact, pair it with a concrete contextual scene.

Not a generated image. Not a decorative metaphor. A text-based “scene trace” that forces the model to encode the moment around the fact.

The result is a useful correction to a lazy industry assumption: long-term memory is not only about storing more. It is also about encoding better.

Flat facts are good at lookup and bad at history

Most production memory layers treat remembered information as a retrieval object. A user says something. The system extracts a fact. The fact is embedded, indexed, and retrieved later when a similar query appears.

That design works reasonably well when the question is a single-fact lookup:

“What city did I say I moved to?”

It works less well when the question needs sequence, revision, or aggregation:

“Which project did I start first?” “What preference did I change later?” “Across all the conversations, which platform grew fastest?”

These are not the same task with more tokens. They require the agent to reconstruct a small history.

That distinction is the paper’s useful starting point. Existing systems tend to optimize retrieval architecture: vector similarity, search strategy, memory blocks, archival memory, conversation history, or RAG over stored notes. The paper shifts attention one step earlier. Before retrieval can work, the memory must have been encoded in a form that gives retrieval something to work with.

A flat record says what happened. A contextual trace helps recover when, how, and relative to what else it happened. Tiny difference. Large consequences. The database people may now inhale sharply.

The mechanism: store the fact, then force the agent to “draw” the moment

The proposed method is called dual-trace memory encoding. Each memory has two linked traces.

Trace	What it stores	Operational role
Fact trace	Structured factual record: category, components, confidence, evidence score, timestamp	Keeps the explicit factual content searchable and auditable
Scene trace	A concrete narrative reconstruction beginning with “Picture:” and linked to the same anchor	Adds contextual cues for time, relation, sequence, and distinctiveness

The point of the scene trace is not to hallucinate extra evidence. The paper explicitly marks these scenes as mnemonic depictions, not proof. That matters. A scene trace is not “the real event.” It is a retrieval scaffold attached to the real fact.

For example, instead of storing only a record like “the user raised $200 at a church bake sale and finished a 5K in 35 minutes,” the scene trace might describe a corkboard with a race bib reading “Finished: 35:00” next to a church bulletin where “$200 raised” is circled. The generated scene makes the abstract fact more distinctive. It gives the model more handles to grab later.

This is where the title’s “draw” becomes more than wordplay. The paper borrows from the drawing effect in human memory research: people often remember concepts better when they draw them because drawing forces elaborative generation. The memory gain is not magic pencil dust. It comes from committing to concrete details.

For LLM agents, the analog is not visual drawing. It is contextual generation at encoding time. The model has to turn a flat fact into a specific situated representation. That creates additional retrieval cues: temporal anchors, spatial relationships, adjacent details, and contrast with other memory episodes.

The business version is blunt: if your assistant only stores CRM facts as bullet points, do not be shocked when it behaves like a spreadsheet with a chatbot costume.

The evidence gate is as important as the scene

A naive version of this idea would store everything as a scene. That would be charming, expensive, and probably unbearable.

The paper avoids that by adding an evidence scoring gate. Each session is scored on three dimensions:

Dimension	Score range	Question it asks
Relevance	0–2	Is there personal or durable information worth remembering?
Specificity	0–2	Is the information concrete enough to be recalled later?
Explicitness	0–2	Was the information directly stated rather than merely implied?

In the controlled comparison, low-score sessions are dropped. Higher-score sessions are stored. The dual-trace condition stores fact plus scene; the control stores fact only.

This gate is not a side detail. In business systems, memory quality is often destroyed by indiscriminate capture. If every transient request becomes memory, the assistant becomes a hoarder with an API key. The paper’s design says: first decide whether the interaction deserves memory; then decide how richly to encode it.

That principle travels well beyond personal assistants. A sales copilot should not remember every casual phrase from a prospect call. A coding agent should not preserve every minor formatting preference as architectural history. An education agent should distinguish a durable learning difficulty from a one-off typo. Memory is a policy problem before it is a storage problem.

Retrieval is not just search; it is a confidence state

The paper also defines a three-state retrieval protocol.

State	What the agent finds	How it should answer
A	Fact trace and scene trace	Reconstruct the scene internally, then answer with higher confidence
B	Fact trace only	Answer from the fact with medium confidence; do not invent a scene
C	Nothing relevant	Abstain rather than confabulate

This is a useful design move because it connects memory retrieval to confidence calibration. Many deployed assistants still treat “no memory found” as an invitation to improvise. Very entrepreneurial. Very bad.

The scene trace matters most in State A. The agent can use the scene to re-enter the encoding context before answering. That is the paper’s practical implementation of encoding specificity: retrieval improves when retrieval cues overlap with encoding context.

The protocol also keeps the system honest. A missing scene does not license the model to create one retroactively. That boundary matters because mnemonic traces are powerful precisely because they feel concrete. Without a “not evidence” discipline, scene memory could become a sophisticated hallucination amplifier.

The main experiment isolates the scene trace rather than merely rewarding more storage

The key comparison is between two conditions:

Condition	What it stores	Coverage	Purpose
C7-control	Structured fact traces only	57.4% session coverage	Fact-only baseline with clean formatting and high coverage
C6-draw	Fact traces plus scene traces for most stored sessions	54.8% session coverage	Dual-trace condition testing the value of contextual scenes

This comparison is the article’s spine. Earlier experimental conditions in the paper show system iteration: vanilla Letta, broken or weaker versions, lower-coverage encoding, and indexing improvements. They are useful engineering history, but they are not the main causal evidence.

The C6 versus C7 comparison is stronger because the two systems are intentionally close. They use comparable coverage, similar structured fact format, the same evidence gate threshold, and the same retrieval infrastructure. The main difference is whether a stored fact is paired with a scene trace.

That makes the result harder to dismiss as “the dual-trace system just stored more.” It did not. The fact-only control actually stored a slightly larger share of sessions. The dual-trace system performed better anyway.

The headline result is large, but the category pattern is the real story

On LongMemEval-S, the paper reports a paired comparison over 99 shared questions. The dual-trace condition achieved 73.7% overall accuracy, compared with 53.5% for the fact-only control. That is a +20.2 percentage point gain, with a 95% bootstrap confidence interval of roughly +12 to +29 percentage points and a bootstrap p-value below 0.0001.

Useful number. Not the most interesting number.

The category breakdown is where the mechanism becomes visible.

Question type	Fact-only control	Dual-trace	Difference	Interpretation
Single-session retrieval	75%	75%	0 pp	Scene traces do not help when one fact is enough
Multi-session aggregation	20%	50%	+30 pp	Context helps combine details across episodes
Knowledge-update tracking	55%	80%	+25 pp	Context helps distinguish old information from later revisions
Temporal reasoning	25%	65%	+40 pp	Context helps reconstruct order and sequence
Abstention	95%	100%	+5 pp	No evidence that scenes made the agent over-answer

This is the paper’s cleanest finding: dual-trace memory does not simply make the model “smarter” in a vague way. It helps where the task needs history.

The null result on single-session retrieval is especially important. A weaker paper would hide it; this paper benefits from it. If scene traces improved every category equally, we would suspect a general prompt effect, format advantage, or evaluation artifact. Instead, the gain appears where the theory predicts: temporal reasoning, updates, and aggregation.

That is why the correct business takeaway is not “add scenes to everything.” It is: use richer contextual encoding where the business task depends on change over time.

Per-question agreement makes the effect less fragile

The paper also reports a paired agreement analysis across the 99 common questions. Both systems answered 51 questions correctly. Both missed 24. The dual-trace system answered 22 questions correctly that the fact-only system missed. The reverse happened only twice.

That asymmetry matters because aggregate accuracy can sometimes hide fragile distribution effects. Here, the direction is lopsided. The added scenes were not merely trading one class of mistakes for another; they produced many more wins than losses.

The authors also note that most of the dual-trace-only wins came from temporally complex categories: temporal reasoning, multi-session aggregation, and knowledge-update tracking. Again, the interesting signal is not just the size of the gain. It is the shape of the gain.

A memory method that improves exactly the tasks it was designed to improve is not automatically production-ready. But it is much more interesting than a benchmark bump with no mechanism attached.

Token cost is not the obvious objection, but do not oversell it

A reasonable objection is that richer memory must be more expensive. The paper’s token analysis says otherwise in this experimental setup.

During teaching, the dual-trace condition generated more completion tokens because it wrote scene traces. However, total token use was slightly lower than the fact-only control because prompt tokens dominated the cost structure. During recall, the dual-trace condition was also slightly cheaper per query. The authors report approximately 1.7% lower token use during encoding and 3.3% lower token use during retrieval for the dual-trace condition.

This is a useful result, but it should be interpreted carefully.

What the paper directly shows is that, under its infrastructure and benchmark setup, dual-trace encoding delivered a large accuracy improvement without increasing token cost. It does not prove that scene traces are always cheaper in production. Cost will depend on model pricing, prompt design, memory store architecture, retrieval limits, and how many traces are injected into context.

The practical point is still meaningful: the richer encoding overhead may be small relative to the cost of repeatedly searching, rereading, and reasoning over poorly structured memory. In other words, a little discipline at encoding time can reduce chaos at retrieval time. Novel concept. Some enterprise systems may recover from the shock.

What this means for enterprise assistants

The paper is framed around long-term agent memory, but the business implications are broader. Many enterprise assistants already fail in ways that look like memory failure but are actually context-structure failure.

Consider five common use cases.

Use case	Flat-memory failure mode	What dual-trace-style encoding could add
CRM copilot	Remembers latest facts but loses the negotiation history	Stores the moment, rationale, objection, and later change
Project assistant	Finds meeting notes but cannot reconstruct decision sequence	Links decisions to scenes: meeting, trade-off, owner, dependency
Customer support agent	Retrieves prior tickets but misses escalation timeline	Encodes incidents as staged narratives with before/after state
Education tutor	Remembers weak topics but not learning progression	Stores learning moments, misconceptions, corrections, and progress
Coding agent	Sees files but forgets why a design choice was made	Stores decisions, incidents, conventions, and debugging “Moments”

The paper’s coding-agent extension is especially relevant, though it should not be treated as a full experimental result. The authors sketch a Letta Code adaptation where “Picture:” becomes “Moment:”, and scenes anchor to file paths, functions, error messages, design decisions, incidents, conventions, patterns, learning progressions, and preferences.

That is the right direction. Software work is not just code state. It is rationale over time. A coding agent that remembers what changed but forgets why it changed is one refactor away from becoming a very confident intern.

The authors report four manual pilot tests for the coding adaptation. These showed routing, retrieval, abstention, and update behavior functioning as designed. Useful, but preliminary. The coding extension is a design sketch with pilot validation, not a controlled benchmark result.

The deployable lesson: build memory schemas around time, change, and rationale

A narrow reading of the paper says: add scene traces to agent memory. A better reading says: redesign memory schemas around the reasoning task the agent must perform later.

For enterprise systems, the following pattern is more useful than copying the exact wording of “Picture:”:

Memory requirement	Encoding design question	Example trace field
Sequence	What happened before and after this fact?	`prior_state`, `event_time`, `next_state`
Update tracking	What did this fact replace or revise?	`supersedes`, `changed_from`, `changed_to`
Rationale	Why was this decision made?	`tradeoff`, `rejected_option`, `decision_driver`
Aggregation	What other sessions does this relate to?	`linked_events`, `project_anchor`, `topic_cluster`
Abstention	What would count as insufficient evidence?	`confidence`, `evidence_score`, `source_session`

A “scene” is one way to package those fields into a model-friendly representation. For some business systems, a structured event schema may be better. For others, a narrative trace may be easier for the model to retrieve and reason over. The paper does not settle that implementation question.

But it does provide a useful architectural principle: do not store memory only in the form that is easiest to extract; store it in the form that supports the reasoning you will need later.

That principle is obvious after someone says it. Many good ideas are rude like that.

What the paper shows, what Cognaptus infers, and what remains uncertain

To avoid turning a good benchmark result into vendor poetry, separate three layers.

Layer	Statement	Confidence
Paper directly shows	On LongMemEval-S, dual-trace encoding outperformed a fact-only control with comparable coverage and format by +20.2 pp overall	Strong within this benchmark
Paper directly shows	Gains concentrated in temporal reasoning, updates, and multi-session aggregation, with no gain on single-session retrieval	Strong directional evidence, though category sample sizes are small
Paper directly shows	Token use did not increase in the reported setup	Valid for the reported infrastructure and prompts
Cognaptus infers	Enterprise assistants handling changing relationships, project histories, or support timelines may benefit from richer contextual encoding	Plausible, requires deployment validation
Cognaptus infers	Coding agents should store design rationale and debugging incidents as contextual “moments,” not only current file state	Plausible, currently supported only by architectural reasoning and pilot tests
Still uncertain	Whether the benefit comes mainly from scene generation during encoding, scene reconstruction during retrieval, or the combination	Open; the paper explicitly calls for ablations
Still uncertain	Whether similar gains hold across other benchmarks, real enterprise data, and non-personal memory domains	Open

The most important uncertainty is the encoding-retrieval confound. The current design both generates scenes during encoding and reconstructs them during retrieval. A clean ablation would test scenes generated but withheld during retrieval, and scenes supplied at retrieval without being generated during encoding. Until then, we should not claim that the gain comes only from encoding-side elaboration.

The paper also uses GPT-4o as an evaluator following the LongMemEval protocol. That is reasonable, but automated grading always introduces evaluator variance. Category-level samples are small: roughly 20 questions per type. The confidence intervals are still informative, but larger evaluations would make the effect size more stable.

These limitations do not erase the result. They locate it.

The industry is overbuilding recall and underbuilding memory

The industry’s default memory roadmap has been predictable: bigger context windows, larger vector stores, more retrieval passes, better rerankers, more summaries. These are useful improvements. They are also mostly about recall logistics.

This paper asks a more uncomfortable question: what if the memory object itself is underdesigned?

A memory system that stores “client prefers quarterly reporting” can answer a lookup question. A system that stores the preference together with the meeting context, prior objection, decision rationale, and later revision can answer a business-history question. Those are different products.

For Cognaptus-style business automation, that distinction is practical. A serious AI assistant should not merely retrieve facts from the past. It should preserve enough structure to reason about the past.

That does not mean every remembered fact needs a miniature novel attached to it. Please do not build that. The better design is selective: use evidence gates to decide what deserves durable storage, then use richer traces when the future task likely involves sequence, updates, aggregation, or rationale.

Flat facts are still fine for flat questions. The trouble is that business rarely stays flat.

Conclusion: the memory layer needs a second dimension

The paper’s most useful contribution is not that it makes agents more “human-like.” That framing is usually where clarity goes to retire.

The useful contribution is narrower and more operational: memory encoding affects reasoning performance. Not just retrieval. Not just storage. Encoding.

Dual-trace memory works because it adds a second dimension to remembered facts. A fact trace preserves the statement. A scene trace preserves enough context for the agent to reconstruct sequence, change, and relation. On the benchmark, that second dimension did nothing for simple lookup and a great deal for history-like reasoning.

That is exactly the pattern enterprise AI should care about.

The next generation of useful agents will not win merely by remembering more. They will win by remembering in structures that match the work: sales histories, support incidents, learning progressions, project decisions, debugging timelines, policy changes, and all the small revisions that make business reality so allergic to clean tables.

The memory was not broken.

It was flat.

Cognaptus: Automate the Present, Incubate the Future.

Benjamin Stern and Peter Nadel, “Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents,” arXiv:2604.12948, 2026. https://arxiv.org/pdf/2604.12948 ↩︎

Flat facts are good at lookup and bad at history#

The mechanism: store the fact, then force the agent to “draw” the moment#

The evidence gate is as important as the scene#

Retrieval is not just search; it is a confidence state#

The main experiment isolates the scene trace rather than merely rewarding more storage#

The headline result is large, but the category pattern is the real story#

Per-question agreement makes the effect less fragile#

Token cost is not the obvious objection, but do not oversell it#

What this means for enterprise assistants#

The deployable lesson: build memory schemas around time, change, and rationale#

What the paper shows, what Cognaptus infers, and what remains uncertain#

The industry is overbuilding recall and underbuilding memory#

Conclusion: the memory layer needs a second dimension#