Memory sounds simple until it becomes a product requirement.

A sales assistant must remember that one client refuses cloud deployment. A software agent must remember that Redis was vetoed after a production incident. A research copilot must remember which hypothesis failed three weeks ago, not because it is charmingly nostalgic, but because repeating failed work is an expensive hobby.

The usual answer is to give the model a larger context window. Add more tokens. Add summaries. Add compressed chat history. Add another small layer of operational denial and call it “persistent memory.”

The paper “Facts as First-Class Objects: Knowledge Objects for Persistent LLM Memory” argues that this is the wrong mental model.1 Its central claim is not that frontier LLMs are bad at retrieving facts from context. In fact, one of the paper’s more useful results is the opposite: Claude Sonnet 4.5 performs extremely well on structured fact retrieval while the facts remain inside the context window. The failure begins later, when the system has to live like a real product: across sessions, through compaction, under cost pressure, and amid a growing pile of project constraints that no one remembers to restate every morning.

That distinction matters. A bigger prompt may improve working memory. It does not automatically create institutional memory.

The paper’s real target is not retrieval failure, but memory lifecycle failure

The obvious version of the problem says: “LLMs forget because they cannot find the right fact in a long context.” The paper tests that intuition and finds something more interesting.

The authors construct synthetic pharmacology-style facts, such as compound-target binding affinities, and ask models to retrieve exact numeric values. This is not a soft benchmark where a vague answer receives partial credit. A queried subject-predicate pair maps to a precise object value. Either the model retrieves the value or it does not.

On this structured task, Claude Sonnet 4.5 achieves near-perfect exact-match retrieval up to 7,000 facts, around 189K tokens, close to the edge of its 200K context window. The paper reports 100% accuracy through most tested within-window sizes, with one seed showing 97% at some large sizes. It also tests 1,000 deliberately confusable facts with high within-family similarity and reports 100% accuracy with no sibling confusion.

That is an inconvenient result for lazy anti-long-context arguments. Inside the window, when facts are clearly structured and the model can attend to them, the model is not stumbling around like an intern searching email by vibes. Full attention over structured facts can work very well.

But the paper’s comparison frame begins exactly there: each memory architecture solves one problem and exposes another.

Architecture What it handles well Where it breaks in the paper Business interpretation
In-context memory Exact retrieval of structured facts within the window Capacity overflow, compaction loss, goal drift, rising query cost Useful as working memory, weak as durable memory
Standard RAG Capacity beyond the context window Adversarial near-duplicate facts where embeddings cannot discriminate Good for broad retrieval, risky for fact-dense operational records
Knowledge Objects Persistent, exact, low-cost lookup of discrete facts Requires parsing, predicate normalization, schema discipline Treat decisions and facts as addressable records, not prompt decoration

This is why the paper is better read as an architecture decision map than as another “LLM memory benchmark.” The question is not whether context windows are useful. They are. The question is what kind of memory we are asking them to be.

In-context memory works until the system has to survive its own housekeeping

The within-window result is the paper’s first important correction. If the facts are still present, and if the prompt fits, frontier models can retrieve structured information with impressive precision. This matters because it narrows the actual failure mode.

The failure is not primarily cognitive. It is systemic.

In-context memory stores facts by serializing them into the prompt. Every query reprocesses the whole memory pile. The advantage is simplicity: no database, no schema, no retrieval pipeline. The disadvantage is also simplicity: no database, no schema, no retrieval pipeline.

At 8,000 facts, the paper reports that Claude’s prompt overflows its 200K-token API limit. This is not a graceful degradation curve. It is a wall. Before the wall, the model can answer. After the wall, the request cannot even run.

A larger context window pushes the wall outward, but it does not remove the wall. Nor does it change the cost structure. In-context memory makes the model reread the entire filing cabinet every time a user asks for one folder. This is not intelligence. This is paying rent on redundancy.

The more serious problem comes from compaction. Real systems cannot keep every token forever. Long-running assistants and coding agents summarize older context to make room for new context. That summary then becomes the new memory. Later, the summary itself may be summarized. The paper calls the resulting degradation context rot: knowledge does not disappear all at once; it decays through ordinary lifecycle operations.

In the fact compaction experiment, the authors compress 2,000 pharmacology facts, roughly 111K characters, into a 3K-character summary. After this 36.7x compression, the model correctly answers only 40% of queried facts. The remaining 60% are lost, with 0% wrong answers reported in that test. This detail is important. The model is not hallucinating wildly. It is often honest that the information is gone.

That sounds reassuring until the business case appears. A system that politely says “I no longer know” after losing most of the fact base is still not a persistent memory system. It is just a well-mannered shredder.

Goal drift is worse than fact loss because the model remains coherent

Losing a binding affinity value is bad. Losing a project constraint is worse.

The paper’s goal drift experiment embeds 20 non-default project constraints into an 88-turn simulated conversation. These are the kinds of constraints that appear naturally during real work: use Python 3.11 rather than 3.12, avoid Redis because the client vetoed it, deploy only to EU regions, use a retry limit of 7 rather than the default 3.

Then the authors simulate cascading compaction. After one compaction round, about 91% of constraints are preserved. After two rounds, preservation falls to about 62%. After three rounds, only about 46% remain correct, with 53% lost and 1% partial across five seeds.

The mechanism is easy to miss because the model’s output does not necessarily become ugly. It may remain fluent, helpful, and plausible. It simply reverts toward defaults. Redis is popular, so it comes back. A 100ms latency target sounds reasonable, so it replaces the unusual 73ms target. Python 3.12 sounds modern, so it quietly displaces the environment constraint.

That is the dangerous part. Context rot does not always announce itself as failure. It often arrives as sensible advice that violates a forgotten decision.

For enterprise use, this is the difference between a chatbot that forgets trivia and an agent that silently changes the operating assumptions of a project. The former is annoying. The latter is a governance problem wearing a productivity costume.

Standard RAG solves capacity, then trips over near-duplicates

A natural response is: “Fine, do not put everything in the prompt. Use RAG.” That is partly correct.

The paper’s standard RAG baseline, using all-MiniLM-L6-v2 embeddings with top-5 retrieval feeding Claude Sonnet, reaches 100% accuracy across the tested pharmacology scaling range. This result should not be ignored. For many corpora, RAG is the right practical upgrade from brute-force context stuffing. It bypasses the context window limit by retrieving a small set of relevant documents.

But the paper then asks a sharper question: what happens when the corpus contains adversarial facts?

Here “adversarial” does not mean malicious. It means semantically near-identical but factually different. Think of multiple clinical trial records for the same compound and target, differing only by trial phase, year, or numeric value. The documents embed close to one another because they are genuinely similar. Unfortunately, the exact difference is the whole point.

On the adversarial benchmark, where within-group similarity is very high, embedding retrieval performs poorly. The accepted plan summarized the headline as 20% precision@1, essentially random among five or so confusable candidates. In the HTML table, the displayed density-adaptive retrieval section contains formatting gaps, but the surrounding text is clear: embedding-only retrieval cannot reliably distinguish the near-duplicates, while exact key matching is needed in high-density neighborhoods.

This is the business lesson: RAG is not one architecture. It is a family of retrieval choices. If your records are broad, varied, and semantically distinct, dense retrieval may be enough. If your records are contract clauses, experiment parameters, drug values, order terms, compliance exceptions, or architectural decisions differing by one field, embedding similarity can become a very polished coin toss.

The paper’s proposed fix is density-adaptive retrieval. The system first retrieves candidates by embedding similarity, then measures how similar those candidates are to one another. If the retrieved set is crowded, the system switches to structured key matching. Low density means embeddings are probably discriminative enough. High density means the semantic neighborhood is too compressed and exact matching should take over.

That is a useful design pattern beyond this paper. Retrieval systems should not ask only “what is closest to the query?” They should also ask “are the closest candidates too similar to trust the distance ranking?”

Knowledge Objects separate storage from reasoning

Knowledge Objects are the paper’s proposed external memory layer. A KO is a discrete, hash-addressed tuple, roughly:

$$(subject, predicate, object, metadata)$$

The LLM still does the language work. It parses a natural-language query into a structured tuple, retrieves the relevant KO from a database, and then uses the retrieved fact to answer. The key difference is that the fact itself is not stored as prose inside the prompt. It lives as an addressable object with provenance.

This gives the architecture three practical properties.

First, lookup cost does not grow with the full corpus. The paper describes a pipeline using a lightweight model for parsing and a primary model for answer generation, with a small fixed prompt per query. The full knowledge base is not reprocessed every time.

Second, compaction does not destroy the stored fact. A database record does not become shorter because a chat session is summarized. This sounds embarrassingly obvious, which is often what good systems design feels like after someone stops trying to make the model do every job.

Third, provenance becomes available as a first-class feature. In enterprise settings, “who said this, when, under what evidence, and with what confidence?” is not decorative metadata. It is how teams audit decisions, reverse bad assumptions, and avoid letting yesterday’s Slack message become tomorrow’s undocumented policy.

The paper reports KO accuracy of 100% across its core single-fact experiments, including scaling conditions that overflow in-context memory and compaction conditions where in-context memory loses facts. It also reports stronger multi-hop performance: 78.9% for KO-grounded retrieval versus 31.6% for full in-context presentation on 19 two-hop queries over a 500-fact corpus. On cross-domain synthesis, KO retrieval improves the judged composite score from 3.0/5 to 4.6/5, with groundedness rising from 2.2 to 4.8.

Those latter experiments should be interpreted carefully. The multi-hop test is small, and the cross-domain synthesis test uses an LLM judge. They are useful as exploratory extensions, not as proof that KOs solve all reasoning. Their value is directional: when the model receives the right facts rather than a large undifferentiated fact pile, compositional reasoning becomes easier and more grounded.

The evidence is strongest where the question is exact memory, not broad intelligence

The paper contains several experiments, but they do not all carry the same evidentiary weight.

Test Likely purpose What it supports What it does not prove
Scaling structured facts in context Main evidence Frontier models can retrieve structured facts very well inside the context window That long context is economical or persistent
8,000+ fact overflow Main evidence Context capacity is a hard system limit That all applications will hit the limit quickly
36.7x fact compaction Main evidence Summarization destroys fact-level recall under heavy compression That every production summarizer loses exactly 60%
Cascading goal compaction Main evidence Project constraints decay silently across repeated summaries That the exact decay curve generalizes to all workflows
Cross-model compaction Robustness check Compaction loss is not obviously one-model-specific That all future compaction methods fail identically
Adversarial retrieval Stress test / comparison Dense retrieval struggles with near-duplicate facts That dense retrieval is poor for ordinary corpora
KO parsing robustness Implementation test Parsing works well in several noisy conditions That predicate normalization is solved
Multi-hop and synthesis Exploratory extension Retrieved structured facts can improve grounded reasoning That KOs are a full reasoning engine

This distinction matters because the paper could easily be oversold. The strongest result is not “Knowledge Objects make AI smart.” The stronger and more useful claim is narrower: when applications require durable, exact, auditable memory of facts and constraints, storing those facts as prompt text is the wrong persistence layer.

That is enough. Not every paper needs to promise artificial general competence before it becomes operationally useful.

The business value is not a nicer chatbot; it is fewer silent reversions

For business teams, the most important implication is not that Knowledge Objects might reduce token cost, although the paper’s cost comparison is sharp. At 7,000 facts, the paper estimates in-context memory at about $0.568 per query versus $0.002 for KO retrieval, a 252x difference. For 25,000 annual queries, that translates to roughly $14,201 versus $56 under the paper’s pricing assumptions.

Cost matters. But the deeper value is behavioral stability.

Many AI deployments fail not because the model cannot produce a plausible answer, but because it cannot maintain the specific history that makes the answer appropriate. The model knows general best practices. The organization needs remembered exceptions.

Those exceptions often look small:

Durable memory item Why prompt-summary memory is risky KO-style operational handling
Client vetoes Summaries drop “negative” decisions as less salient Store vetoes as explicit constraints with provenance
Compliance boundaries Defaults may violate jurisdiction-specific rules Store deployment, data, and retention constraints as queryable objects
Technical decisions Later advice reverts to common patterns Store architecture choices and rejected alternatives
Experiment failures Failed paths vanish from summaries Store negative results as first-class evidence
Contact and ownership rules Project coordination details compress poorly Store responsibility and escalation facts separately

This is where Cognaptus would translate the paper into implementation practice: do not ask the model to “remember the project.” Decide which parts of the project deserve durable identity.

A useful enterprise memory layer should distinguish at least four categories:

  1. Facts: values, parameters, metrics, IDs, dates, names, and source-grounded statements.
  2. Decisions: selected options, rejected options, and reasons.
  3. Constraints: rules that should govern future output even when not restated.
  4. Provenance: where each fact or decision came from, when it was recorded, and how reliable it is.

The model can still summarize. It can still reason. It can still write the report, draft the email, generate the code, or explain the trade-off. But durable memory should not be a side effect of yesterday’s prompt.

Where the paper’s proposal still needs engineering discipline

Knowledge Objects are not magic beans, even if the name sounds suspiciously like something a vendor would put on a booth banner.

The paper’s own limitations point to the hard parts. The main corpora are synthetic pharmacology triples with unambiguous ground truth. Real business knowledge is messier. Predicates mutate. Terms drift. A “client veto” may later become a “temporary exception.” Two departments may assert conflicting rules. A decision may expire. A regulatory constraint may depend on jurisdiction, contract type, and data class.

The KO pipeline also depends on parsing natural language into the right tuple. The paper’s robustness tests show strong parsing under clean queries, clinical abstracts, conversational text, and coreference-heavy inputs, but only 80% accuracy under messy query phrasing. That failure mode is not fatal, but it is very real. Predicate normalization, synonym mapping, schema governance, and fallback search are not optional in serious deployments.

There is also a design question around granularity. Store too little, and the memory misses what matters. Store too much, and the system becomes a junk drawer with hash keys. Store facts without lifecycle rules, and the memory becomes stale. Store decisions without provenance, and no one knows whether a constraint came from a signed contract or a Monday brainstorm conducted under caffeine deprivation.

The architecture therefore shifts the challenge. It does not remove it. Instead of hoping the prompt summary preserves everything important, teams must define what “important” means operationally.

That is progress. A hard systems problem is still better than a soft hallucination disguised as memory.

Bigger prompts are useful working memory, not durable institutional memory

The paper’s most useful correction is subtle: it does not dismiss long-context models. It shows where they are strong.

Long context is excellent for active reasoning over a bounded workspace. If the facts are present, structured, and within the window, frontier models can retrieve them impressively well. For short projects, exploratory analysis, and temporary work sessions, in-context memory is convenient and often sufficient.

But enterprise memory has different requirements. It must survive session boundaries. It must resist compaction loss. It must support deletion, updates, provenance, audit, access control, and selective retrieval. It must remember not only what is generally true, but what this client, this system, this contract, this experiment, or this product team decided.

A larger context window delays the moment when memory management becomes necessary. It does not eliminate memory management. Eventually the system must compress, retrieve, archive, update, or forget. At that moment, architecture matters more than window size.

The paper’s comparison can be compressed into one practical rule:

Use context for thinking. Use structured storage for remembering.

That rule is not glamorous. It will not produce a keynote demo where the model appears to “remember everything.” It will, however, reduce the chance that an agent confidently violates a forgotten constraint because a summary decided it was less important than a polite transition sentence.

In other words, the future of AI memory may look less like a bigger brain and more like a disciplined database with an LLM interface. Less mystical, more useful. Tragic for the marketing department. Good for everyone else.

Cognaptus: Automate the Present, Incubate the Future.


  1. Oliver Zahn, Simran Chana, et al., “Facts as First-Class Objects: Knowledge Objects for Persistent LLM Memory,” arXiv:2603.17781, 2026, https://arxiv.org/abs/2603.17781↩︎