Memory has become the awkward invoice attached to every serious AI agent demo.

A short chatbot can survive on vibes. A long-running coding assistant cannot. After a few weeks of debugging sessions, architecture debates, config changes, rejected fixes, and “remember we tried this already?” moments, the agent’s past becomes valuable. It also becomes inconveniently large. The obvious solution is to stuff more transcript into the prompt. The obvious solution is usually how software gets expensive before it gets useful.

The paper behind this article, Structured Distillation for Personalized Agent Memory: 11× Token Reduction with Retrieval Preservation, studies a narrower and more practical question: can one developer’s long AI-agent conversation history be compressed into a structured memory layer without making future recall useless?1

The headline result is attractive: average exchanges are reduced from 371 tokens to 38 tokens, an 11× compression, while the best pure distilled retrieval configuration preserves 96% of the best verbatim retrieval MRR. But the useful lesson is not “summarization works.” That would be too easy, and also wrong.

The useful lesson is more architectural: compressed memory works when it is treated as a routing index, not as a replacement for the original conversation. It works better with vector retrieval than with keyword retrieval. And it becomes most practical when the distilled layer and the verbatim archive are allowed to do different jobs.

That is less glamorous than “agents now remember everything.” Good. It is also more likely to survive contact with production.

The mistake is treating memory as a shorter transcript

Most long-context systems face the same problem. The agent has accumulated useful history, but the context window is finite and the budget is not imaginary. When the transcript gets too large, systems often compact it: ask a model to summarize earlier turns, keep the summary, and discard or ignore the raw detail.

That approach has a familiar failure mode. The first summary is useful. The second summary is a summary of the summary. The third begins to sound confident about things it no longer actually knows. Eventually, the agent remembers that “there was an issue with authentication,” which is almost as useful as remembering that “software was involved.”

The paper takes a different position. The full conversation is not necessarily gone. In many agent workflows, especially local developer-agent workflows, the transcript already exists on disk. The challenge is not to replace it with a smaller diary. The challenge is to build a compact index that can point back to the right original exchange.

That framing changes the design problem.

Bad memory question Better memory question
How do we summarize everything? What fields must survive so the right exchange can be found later?
How short can the transcript become? Can the compressed object still route recall queries?
Can the summary answer the user? Can the summary identify the original conversation that should be shown?
Is compression good or bad? Which retrieval mechanism survives compression?

The paper is not testing generic conversation summarization across many users. It is testing structured, per-exchange distillation on one developer’s AI coding history across six projects. That boundary matters. The work is about personalized agent memory, not universal memory magic. As usual, the magic costs extra and is not included in the benchmark.

What the paper actually builds: a routing index, not a miniature diary

The system begins by splitting conversations into exchanges. An exchange is basically a user request plus the assistant’s substantive response, with tool-only round trips handled as part of the current exchange. Very short exchanges are filtered; very long exchanges are split.

Each exchange is then distilled into a structured object with four components:

Component Source What it keeps Practical analogy
exchange_core LLM-generated What was accomplished or decided Commit message
specific_context LLM-generated One concrete technical detail: parameter, error, file path, value The discriminating clue
room_assignments LLM-generated Thematic categories such as file, concept, or workflow Directory structure
files_touched Regex-extracted File paths mentioned in the raw exchange Changed files

For search, the core distilled text is mainly:

exchange_core + specific_context

That combination averages 38 tokens per object, compared with 371 tokens for the average verbatim exchange. The evaluated corpus contains 4,182 conversations, 14,340 exchanges, and 12,427 distilled objects from one developer’s Claude Code sessions across six software engineering projects.

Two details make the method more interesting than ordinary summarization.

First, the prompt emphasizes surviving vocabulary. The distiller is not asked to write elegant prose. It is asked to preserve the specific terms used in the exchange. That matters because future recall often starts with fragments: a file name, an error string, a parameter name, a library, a phrase the developer remembers using. Pretty paraphrase is not helpful when the query depends on ugly exact wording. Software engineering, regrettably, rewards ugly exact wording.

Second, the distilled object is not the final evidence shown to the user. It carries a back-reference to the original conversation ID and ply range. Search can operate over distilled text, but the result displayed to the user is the original verbatim exchange. This is the paper’s core architectural move: separate the retrieval representation from the display representation.

That separation is what prevents the article from becoming another sermon about “summarize less badly.” The distilled object does not have to be a perfect human-readable record. It has to route the system toward the right original record.

The compression result is real, but the retrieval mechanism decides whether it matters

The evaluation asks whether compressed memory still supports recall. The authors construct 201 recall-oriented queries over the same user’s history, covering conceptual queries, phrase queries, and exact-term queries. They test 107 configurations across pure and cross-layer retrieval modes, retrieving up to 10 results per query. Relevance is graded on a 0–3 scale by five local LLM graders, with a calibrated Claude Opus grader resolving no-majority cases. The final dataset contains 214,519 consensus-graded query-result pairs.

The headline results are compact enough to fit in one table:

Retrieval setup Main result What it means
Best verbatim baseline: Full Text / BM25-FTS MRR 0.745 Strongest pure search over raw conversation text
Best pure distilled: Distill Core+Rooms / Exact / Weighted MRR 0.717 Preserves 96% of best verbatim MRR
Best cross-layer: BM25 on verbatim + HNSW on distilled MRR 0.759 Slightly exceeds best pure verbatim by combining complementary signals

The tempting but sloppy reading is: “distilled memory is as good as raw memory.” Not quite.

A more accurate reading is: structured distillation preserves enough semantic information that vector-style retrieval remains close to verbatim retrieval, while keyword retrieval suffers because compression removes too much lexical surface area.

That distinction is the paper’s real mechanism.

In pure-mode comparisons, all 20 vector-search comparisons are non-significant after Bonferroni correction: 10 of 10 HNSW configurations and 10 of 10 Exact vector configurations. By contrast, all 20 BM25 comparisons degrade significantly: 10 of 10 BM25-Okapi and 10 of 10 BM25-FTS. Effect sizes range from 0.031 to 0.756, and the medium-sized effects all occur in BM25.

This is not mysterious. BM25 is hungry for lexical overlap. Compress an exchange from 371 tokens to 38, and many rare words disappear. The paper reports that only 27.0% of the top-15 highest-IDF tokens per verbatim exchange survive in the distilled text. That is bad news for keyword matching.

Vector retrieval, however, can survive when the semantic shape of the exchange remains intact. If the distilled text says the exchange fixed a retry timeout in the authentication middleware, a vector model can still match a later query about retry behavior or login failures even if many original tokens have vanished.

So the mechanism-first lesson is straightforward:

Mechanism What compression removes Expected effect Observed pattern
Vector search Many surface tokens, but some semantic structure remains Smaller degradation Non-significant degradation across HNSW and Exact comparisons
BM25 keyword search Rare terms and exact lexical overlap Larger degradation Significant degradation across all BM25 comparisons
Cross-layer fusion Uses both raw lexical signal and distilled semantic signal Potential complementarity Best overall MRR slightly exceeds pure verbatim

This is why “11× compression” alone is not the story. Compression is not valuable in isolation. Compression is valuable only when the retrieval mechanism still has something to grip.

Query type reveals what the memory object keeps and what it drops

The query-type results make the same point from another angle.

Exact-term queries sometimes hold up surprisingly well under distillation, especially when file metadata is available. The paper gives an example where Core+Files / HNSW / Weighted scores better than Full Text / HNSW on exact-term queries. That makes sense: if a file path or identifier is explicitly extracted, the distilled object may place the discriminating clue in a cleaner retrieval surface than the noisy full transcript.

Conceptual queries are harder. They show a larger verbatim advantage. Compression can preserve a decision or a parameter, but abstract discussions often depend on surrounding context, rationale, tradeoffs, or phrasing that a 38-token object cannot fully carry.

Phrase queries sit in the uncomfortable middle. They benefit from preserved terminology, but they also suffer when the exact phrase is among the lexical material discarded during compression.

This matters for business design because users do not ask only one kind of memory question.

A developer may ask:

  • “Where did we touch auth_middleware.py?”
  • “What was that retry timeout bug?”
  • “Why did we reject the simpler cache design?”

These are not the same retrieval problem. The first is close to exact metadata. The second is a mixed semantic-and-technical recall query. The third may require a richer verbatim conversation because the answer lives in reasoning, not just in labels.

A production memory system should not pretend those are equivalent. The paper does not pretend either.

The cross-layer result is complementarity, not distilled superiority

The best overall configuration fuses BM25 keyword search over verbatim text with HNSW vector search over distilled text. It reaches MRR 0.759, slightly above the best pure verbatim baseline at 0.745.

This is not evidence that distilled text is inherently better than raw text. The paper is explicit about that. The cross-layer system uses both representations. It benefits because each layer fails differently.

The coverage analysis is useful here. In a best-vs-best comparison, the best full-text configuration solves 122 of 201 queries at P@1, while the best distilled configuration solves 119. Their overlap is only 77 queries: 45 are solved only by full text, and 42 only by distilled text. That near-symmetry is more interesting than a small difference in aggregate score.

It means the distilled index is not merely a weaker copy of the verbatim index. It creates alternate retrieval pathways. Exchange cores, room assignments, and file metadata can surface conversations that raw text search misses, while raw text still catches details that distillation drops.

The paper also tests a post-hoc reranking idea: take the top-10 distilled candidates, score each candidate’s verbatim snippet with BM25, and blend that score with the original retrieval score. This narrows the P@1 gap from 22 queries to 2 queries: the best reranked distilled setup reaches 120/201, compared with 122/201 for the best full-text BM25 baseline.

But the reranker is not a free lunch. It promotes 36 queries by moving the right result up from rank 2–7. It also demotes 16 queries because the wrong snippet has more lexical overlap in the wrong context. The same BM25 signal creates both the improvement and the damage. The authors test possible gates, such as score margins and term-overlap fractions, but none separates gains from losses. The best single-feature gate performs worse than the ungated reranker.

That result is a nice antidote to lazy hybrid-search enthusiasm. Combining signals helps, until the signal confidently points at the wrong thing. Retrieval systems have a talent for being wrong with excellent posture.

How to read the experiments without overbuying them

The paper contains several result families. They should not all be interpreted as the same kind of evidence.

Paper component Likely purpose What it supports What it does not prove
Corpus compression statistics Main measurement The structured objects reduce exchange length from 371 to 38 tokens on this corpus That the same ratio holds across users, languages, or domains
Main retrieval table Main evidence Pure distilled retrieval can preserve much of verbatim MRR; cross-layer retrieval can add complementary signal That distilled-only memory is universally better than raw transcript search
Mechanism-specific significance tests Core mechanism evidence Vector retrieval survives compression much better than BM25 That all embedding models or vector databases will behave the same way
Query-type analysis Diagnostic interpretation Exact, phrase, and conceptual queries degrade differently That the query taxonomy fully covers real user recall behavior
Grade-distribution analysis Robustness/sanity check Distillation does not wreck relevance everywhere; differences concentrate by mechanism and query type That grading noise is solved
Coverage analysis Exploratory diagnostic Verbatim and distilled modes solve overlapping but distinct query sets That distilled text is inherently more searchable; configuration counts differ
Post-hoc reranking Exploratory extension Verbatim lexical reranking can recover many distilled misses but also causes losses That offline textual features are enough to safely decide when to rerank
Inter-rater agreement analysis Evaluation boundary The main directional pattern survives across graders That the absolute grades are highly reliable

The inter-rater agreement deserves attention. Fleiss’ kappa across the five local graders is only 0.175, which the paper classifies as slight agreement. That is not a decorative limitation; it affects how aggressively we should interpret small numerical differences.

The paper tries to manage this problem rather than hide it. It uses five local LLM graders, majority voting, conservative tie-breaking, and a calibrated Opus grader for the 741 no-majority pairs. Human validation of the Opus adjudication is strong within that decisive stratum. Still, the low agreement says the task itself is hard: deciding whether a fragment of a software-engineering conversation answers a short recall query requires context and domain judgment.

The reason the main finding remains useful is that the pattern is directional and mechanism-specific. If short distilled snippets were simply being rewarded by graders, distilled modes should look better across mechanisms. They do not. BM25 degrades sharply; vector modes do not. That structure is harder to explain away as generic grader bias.

The business value is cheaper continuity, not magical memory

For companies building AI workflow products, the paper points toward a practical memory stack:

  1. Keep the verbatim archive as the source of truth.
  2. Distill each completed exchange into a small structured object.
  3. Use the distilled object as a retrieval and context-routing layer.
  4. Use vector retrieval for semantic recall, but keep keyword search over verbatim text for exact matches.
  5. Show users the original conversation when they drill down.

This architecture is not glamorous. It is also exactly the sort of unglamorous architecture that makes agents usable after the demo.

The ROI path is not that the agent becomes omniscient. The ROI path is that persistent memory becomes cheaper to operate. The paper’s concrete memory-budget example is useful: 1,000 distilled exchanges require about 39,000 tokens, while the verbatim alternative would require about 407,000 tokens. That difference changes what can fit into a prompt, how often retrieval must run, and how much historical context can be carried into downstream reasoning.

For a coding assistant, support agent, research copilot, or operations automation tool, the benefit is not merely lower token cost. It is lower repetition cost. The agent stops asking for decisions already made. It stops rediscovering failed approaches. It can route back to old debugging context without forcing the user to become the memory system.

That is the quiet business case: less rework, fewer repeated explanations, better continuity across sessions. Not “AI remembers like a human.” More like “the system finally has an index card drawer.” Civilization has been built on less.

What builders should copy, modify, and avoid

The first thing to copy is the index/display separation. Do not force compressed memory to serve as both retrieval surface and evidence. A short object can be excellent for routing and poor as a complete record. That is not a defect if the original record remains available.

The second thing to copy is field discipline. The paper’s objects are not open-ended summaries. They contain a core action, one concrete technical detail, thematic placement, and file references. For another domain, those fields should change. A legal workflow might need matter, clause, jurisdiction, and document reference. A customer-support workflow might need account issue, product area, resolution, and escalation status. The principle is not the exact schema. The principle is that memory should be structured around future recall.

The third thing to copy is mechanism-aware evaluation. If the product relies on BM25-like search, aggressive compression may hurt. If it relies on vector retrieval, compression may be safer, but exact identifiers still need protection. A serious evaluation should test query types separately rather than average everything into one comforting dashboard number.

What should be modified? The paper uses one developer’s software-engineering corpus. A commercial system should validate across users, teams, language styles, and task types. It should also evaluate downstream agent performance, not only retrieval quality. Retrieval preservation is necessary for useful memory. It is not sufficient.

What should be avoided is the standard product-manager fairy tale: “We will summarize conversations and the agent will remember.” Summaries do not remember. Systems remember when they preserve the right handles, index them properly, and keep the underlying evidence available.

Where the evidence stops

The paper is candid about its boundaries, and those boundaries matter for business interpretation.

The dataset is single-user. That is not a flaw for a personalized-memory study, but it blocks broad generalization. Different users may phrase recall queries differently. Non-developer workflows may have different clue structures. Multilingual conversations may change both compression quality and retrieval behavior.

The evaluation does not compare against commercial or open-source memory systems such as Letta, Mem0, or Zep. So the result is not “this architecture beats memory products.” It is “within this system, structured distillation appears viable as a compressed retrieval layer.” That is still useful, just less convenient for slide decks.

The spatial “memory palace” interface is also not evaluated as a user experience. Rooms are tested as metadata for retrieval, not as a navigable interface where users walk through memory. Anyone selling the spatial metaphor should notice that distinction before hiring a 3D designer. Please do not build a VR filing cabinet unless the users explicitly deserve punishment.

Finally, the study does not directly test whether an agent performs tasks better when given distilled context instead of verbatim context. The retrieval evaluation is a proxy for information preservation. A good proxy, but still a proxy. The next business-relevant test would ask whether agents using this memory layer complete real follow-up tasks faster, make fewer repeated mistakes, and require fewer user corrections.

The article in one sentence

Structured distillation is not a way to throw away conversation history; it is a way to build a small, searchable routing layer above the history you wisely kept.

That is the practical contribution. The paper shows that, for one developer’s coding-agent history, a schema-guided distilled memory object can cut token volume by 11× while preserving much of retrieval usefulness, especially under vector search. It also shows why keyword retrieval is fragile under compression, why cross-layer retrieval is promising, and why the original transcript should remain the ground truth.

For AI agent builders, the lesson is refreshingly concrete. Do not ask memory to be one thing. Let distilled objects handle fast recognition. Let vector retrieval handle semantic recall. Let verbatim archives handle evidence. Let keyword search remain available for exact wording. Then test the whole stack by query type, because averages are where retrieval failures go to look respectable.

Long-term agent memory will not be solved by making the context window bigger and hoping nobody checks the bill. It will be solved by designing memory as infrastructure: structured, indexed, layered, and humble about what compression forgets.

Cognaptus: Automate the Present, Incubate the Future.


  1. Sydney Lewis, “Structured Distillation for Personalized Agent Memory: 11× Token Reduction with Retrieval Preservation,” arXiv:2603.13017, 2026. https://arxiv.org/abs/2603.13017 ↩︎