Memory is useful until nobody can find where it lives.

That, in miniature, is the operational problem with today’s language models. They can answer questions, imitate expertise, retrieve fragments of the past, and produce very confident nonsense with the composure of a senior consultant who has just discovered bullet points. But when a model gives a wrong factual answer, the organisation deploying it faces an awkward question: where, exactly, is that wrong fact stored?

In a database, you update a row. In a document system, you replace a file. In a human team, you send an email and hope somebody reads it. In a conventional Transformer, factual knowledge is distributed across parameters. The knowledge is not sitting in a convenient drawer labelled “presidents,” “drug contraindications,” or “pricing policy.” It is smeared across weights, activations, and learned associations. Elegant for scale. Annoying for maintenance. Terrible for audit.

The paper behind ExplicitLM attacks precisely that issue. Yu and colleagues propose a Transformer-style architecture that separates part of the model’s knowledge from its parameters and places it into explicit, inspectable memory banks.1 This is not merely a faster retrieval wrapper or a polite rebranding of RAG. The interesting move is architectural: factual knowledge becomes a token-sequence memory object that the model can retrieve during its own computation, while the retrieval mechanism remains differentiable enough to train with the model.

The paper’s commercial relevance is therefore not “here is a better chatbot.” That would be the usual, and usually lazy, summary. The more useful reading is: ExplicitLM is an early attempt to make model knowledge behave less like mythology and more like infrastructure.

The mechanism: give factual knowledge an address

ExplicitLM begins from a simple division. Some knowledge is implicit: grammar, style, semantic patterns, the messy statistical texture of language. Some knowledge is explicit: facts, entity relationships, time-sensitive claims, and statements with clear truth conditions. The Eiffel Tower has a height. A president changes. A medical guideline can be superseded. These are not vibes. They are candidates for storage.

The model therefore introduces a shared Memory Bank accessible across Transformer layers. In the paper’s implementation, the memory bank has capacity $N = 10^6$ entries, and each entry is a token-index sequence of maximum length $L = 16$. In plain English: the memory is not an opaque vector blob only a GPU could love. Each entry can be decoded back into human-readable text.

That matters because the model’s knowledge becomes, at least partly, inspectable. A stored fact can theoretically be examined, traced to a source, modified, or removed. The paper also maps memory entries to source UUIDs during dataset construction, which points toward provenance tracking rather than just retrieval convenience.

The architecture splits the Memory Bank into two regions:

Memory region What it stores How it behaves Operational interpretation
Frozen explicit memory Curated factual entries such as entity relationships and time-sensitive facts Immutable during training Protects verified facts from being washed away by gradient updates
Updatable implicit memory Learned linguistic and semantic patterns Updated during training using EMA-style updates Lets the model adapt without pretending every useful pattern is a clean database fact

The paper initially describes a default freeze rate of $\rho = 0.2$, meaning 20% of the memory is frozen explicit knowledge and 80% remains updatable. Later sensitivity experiments find that $\rho = 0.4$ performs best across most training sizes. That distinction is important. The concept does not depend on a magic 20/80 ratio. The evidence suggests the explicit-versus-implicit balance is a tuning problem, not a slogan.

This is the paper’s first real contribution: it changes the storage model. Instead of asking whether all knowledge can be compressed into parameters, it asks whether factual knowledge should be.

A reasonable enterprise architect might respond: “Congratulations, you have invented a database.” Not quite. A database waits outside the model. ExplicitLM’s memory is inside the model’s computational loop. That is where the second contribution appears.

This is not ordinary RAG, although the family resemblance is obvious

The likely misconception is that ExplicitLM is just Retrieval-Augmented Generation with a more academic haircut. The confusion is understandable. Both systems involve retrieving external information. Both are motivated by stale or incomplete model knowledge. Both promise some degree of updateability.

But their control surfaces differ.

In a typical RAG pipeline, a retriever finds documents or passages and appends them to the prompt context. The language model consumes the retrieved context but is usually not trained end-to-end with the retriever in the same architectural loop. The retrieval store is external. The model’s weights remain largely unaware of the retrieval mechanism’s internal incentives.

ExplicitLM instead inserts memory retrieval into the model architecture. Each layer can retrieve from the shared Memory Bank. The retrieval mechanism is jointly optimised with the language modelling objective. The paper is explicit on this point: unlike RAG systems with frozen retrieval components, ExplicitLM uses differentiable selection so retrieval and generation can learn together.

A compact way to see the difference:

Question Ordinary RAG ExplicitLM
Where is the knowledge store? Outside the model pipeline Inside the model architecture as a Memory Bank
What is retrieved? Usually text chunks or documents Token-sequence knowledge entries
Is retrieval jointly trained with generation? Often no, or only partially Yes, through differentiable retrieval
Is memory inspectable? The external corpus is inspectable Individual model-accessible memory entries are designed to be human-readable
Main business value Add current context at inference time Make part of model knowledge auditable, editable, and traceable

This does not make RAG obsolete. Please do not fire your retrieval engineers in a burst of architectural enthusiasm. RAG remains practical because it works with existing models and external knowledge bases. ExplicitLM is more fundamental and less immediately deployable: it proposes a different way to organise model knowledge.

The business distinction is crucial. RAG is usually an application pattern. ExplicitLM is a model design proposal.

The retrieval trick: discrete memory without breaking training

The difficult part is obvious once stated. If the memory bank contains a million entries, the model cannot compare every query against every memory entry at every layer in a naive way. That would be less “explicit memory” and more “expensive regret.”

ExplicitLM uses a two-stage retrieval mechanism.

First, it performs coarse filtering using product-key decomposition. The model maps input into a query vector, compares it with product keys assigned to memory entries, and retrieves a candidate set. Instead of searching over $N$ entries directly, the keys are decomposed into Cartesian-product components, reducing the stated complexity from $O(N \cdot |I|)$ to $O(\sqrt{N} \cdot |I|)$, assuming the candidate set $|I|$ is much smaller than $\sqrt{N}$.

Second, it applies fine-grained similarity selection. Among the candidates, it computes cosine similarity and uses Gumbel-Softmax with a straight-through estimator. In the forward pass, the model selects a discrete memory entry. In the backward pass, it keeps gradients flowing through soft weights.

That is the engineering compromise: choose one memory entry as if retrieval were discrete, but train the system as if the choice remained differentiable enough to optimise. It is inelegant in the way useful machine learning tricks often are. The plumbing matters more than the poetry.

The training objective combines three losses:

$$ L_{\text{total}} = L_{\text{CE}} + \lambda_{\text{sim}}L_{\text{sim}} + \lambda_{\text{div}}L_{\text{div}} $$

The cross-entropy loss keeps language modelling on track. The similarity loss pushes retrieval toward semantically relevant memories. The diversity loss discourages the system from collapsing into a tiny neighbourhood of overused entries. This last part matters because a memory system that retrieves the same fashionable facts repeatedly is not memory. It is a corporate slide deck.

The main evidence: gains are largest when training data is scarce

The experiments are built around controlled knowledge tasks derived from the Memory Bank. The dataset combines Wikipedia, Project Gutenberg, and OpenWebText. Wikipedia-derived structured entries are used for explicit knowledge graph extraction and Memory Bank initialisation. The evaluation uses three tasks:

  1. Object Prediction: given a subject-predicate pair, predict the correct object among candidates.
  2. Relation Reasoning: given entity token pairs, infer their semantic relationship.
  3. Fact Verification: classify statements as true or false, with negative samples generated by token substitution.

The paper’s design tries to prevent a trivial memorisation story: test samples come from the frozen partition, while training excludes tokens from those frozen entries. That does not make the benchmark equivalent to production reality, but it does make the controlled comparison more meaningful.

Here is the central result, comparing ExplicitLM with a standard Transformer baseline under different supervised fine-tuning data volumes:

Data volume Model Object Prediction Relation Reasoning Fact Verification
10k Baseline 7.86% 38.27% 61.71%
10k ExplicitLM 28.42% 70.02% 66.03%
25k Baseline 22.16% 79.99% 71.49%
25k ExplicitLM 63.12% 87.85% 79.79%
50k Baseline 30.23% 83.80% 83.34%
50k ExplicitLM 73.90% 90.41% 86.25%
75k Baseline 40.64% 87.66% 86.40%
75k ExplicitLM 79.76% 92.12% 88.74%
100k Baseline 56.80% 91.91% 88.92%
100k ExplicitLM 80.94% 92.73% 89.75%

The pattern is not uniform, which makes it more interesting.

Object Prediction shows the strongest and most persistent gains. At 10k samples, ExplicitLM reaches 28.42% versus 7.86% for the baseline, a 3.62× improvement. At 50k samples, the absolute gain reaches 43.67 percentage points. Even at 100k, the memory-augmented model still leads by 24.14 points.

Relation Reasoning also improves sharply at low data volume: 70.02% versus 38.27% at 10k. But by 100k, the baseline nearly catches up, with ExplicitLM ahead by only 0.82 points. Fact Verification shows more modest gains, particularly at higher data volumes.

The correct interpretation is not “ExplicitLM beats Transformers everywhere forever.” The better interpretation is narrower and more useful: explicit memory helps most when the task depends on precise entity-level recall and the supervised data budget is limited. As the baseline sees more data, it learns enough parametric structure to reduce the advantage, especially on reasoning and verification tasks.

That is not a weakness. It is a map of where the architecture might matter.

The diagnostic tests show the bottleneck is retrieval quality

The paper does more than report a scoreboard. It asks whether memory retrieval is actually related to successful prediction.

The Memory Bank hit-rate analysis examines Relation Reasoning and checks whether relevant memory is retrieved at each Transformer layer. Correctly answered samples show much higher memory hit rates than incorrect ones. Across models trained on 100k, 50k, 25k, 10k, and 5k samples, correct samples show overall hit rates of 71%, 65%, 66%, 71%, and 71%, respectively. Incorrect samples show 23%, 21%, 21%, 22%, and 37%.

This is mechanism validation, not just performance decoration. It supports the claim that the memory system is doing useful work rather than merely adding parameters under a respectable name.

The layer-wise analysis also finds elevated hit rates at layers L1 and L3. That suggests some layers become more important integration points for external memory. For a future engineering team, this matters because it hints that memory access may not need to be uniformly heavy across all layers. Selective placement could become a cost-control lever.

Then comes the perfect-retrieval test. The authors intervene at layers L1 and L3, replacing the top-ranked retrieved candidate with the oracle-relevant memory entry. This is not main evidence for production performance. It is an upper-bound diagnostic: if retrieval were better, how much headroom remains?

The answer is: some, but not infinite.

Data volume Retrieval mode Object Prediction Relation Reasoning Fact Verification
50k Retain 70.87% 89.87% 85.12%
50k Replace 74.49% 92.12% 87.24%
75k Retain 77.12% 90.25% 88.00%
75k Replace 79.85% 91.87% 90.25%
100k Retain 79.12% 90.50% 90.37%
100k Replace 81.00% 92.25% 91.12%

The average gain from perfect retrieval is 2.11 percentage points. At 50k, the average improvement is 2.66 points; at 100k, it falls to 1.46. This tells us two things. First, retrieval quality remains a real bottleneck. Second, as training data grows, the model develops compensating internal representations, so retrieval perfection matters less.

This is a useful restraint on the architecture’s claims. If perfect retrieval only adds a few points in later regimes, then the business case cannot rely solely on accuracy gains. The stronger case is governance: inspection, correction, provenance, and updateability.

The freeze-rate experiment is a sensitivity test, not a second thesis

The freeze-rate experiment varies $\rho$, the share of memory allocated to frozen explicit knowledge, and evaluates Relation Reasoning performance. Its likely purpose is sensitivity analysis: does the architecture depend on one fragile memory split?

The paper reports that ExplicitLM outperforms the baseline across freeze-rate configurations. In the low-data setting of 10k samples, the method achieves at least 83% improvement regardless of $\rho$. At 100k, where the baseline already reaches 91.91%, ExplicitLM still maintains 0.3% to 3.3% improvements.

The more revealing result is non-monotonicity. Performance peaks around $\rho = 0.4$ across most training set sizes. Too little frozen memory may weaken factual preservation. Too much frozen memory may reduce the model’s ability to adapt through updatable entries.

For business readers, the lesson is mundane but important: explicit memory is not free governance magic. The partition between stable facts and adaptable patterns becomes a configuration decision. In regulated or high-change domains, that configuration would probably need monitoring, not a one-time setting chosen by a research team and then blessed forever.

What the paper directly shows, and what business should infer

The paper directly shows that a Transformer-like model with explicit token-sequence memory banks can outperform a standard Transformer baseline on controlled knowledge-intensive tasks. It shows especially strong gains in low-data object prediction and relation reasoning. It shows that successful retrieval correlates strongly with correct answers. It shows that better retrieval can still improve performance, although the marginal gain shrinks with more training data.

What Cognaptus infers is more strategic.

If this line of work matures, explicit memory could change how enterprises manage AI knowledge. Instead of treating a model as a sealed statistical artefact, organisations could separate at least some factual knowledge into inspectable, source-linked, editable memory. That would not eliminate hallucinations. It would not solve reasoning. It would not make procurement departments suddenly enjoyable. But it could make several operational workflows less absurd.

Business need How ExplicitLM points toward it What remains uncertain
Knowledge updates Factual entries can be stored outside core parameters and potentially modified directly The paper does not demonstrate a full enterprise update workflow
Auditability Token-sequence memory entries can be decoded and linked to source UUIDs Inspection at scale, policy review, and conflict resolution remain open
Model governance Explicit memory creates a clearer boundary between stable facts and learned patterns Governance depends on memory curation quality and retrieval reliability
Lower-cost correction Updating memory could be cheaper than retraining or weight editing The paper does not benchmark real update cost against RAG or model editing systems
Domain deployment Low-data gains suggest value where labelled task data is scarce but factual knowledge can be curated Results are controlled, not proven on messy enterprise workloads

The most plausible near-term pathway is not replacing RAG. It is hybrid governance architecture: conventional retrieval for broad documents, explicit model memory for validated high-value facts, and parametric learning for language and generalisation. In other words, a system where different kinds of knowledge stop pretending to be the same thing.

That would be progress. Slightly less glamorous than “AGI,” admittedly, but much more useful to anyone who has ever had to explain an AI failure to a compliance committee.

Boundaries: explicit memory still needs somebody to keep it honest

The paper is careful enough to note the central limitation: the current implementation requires manual curation of explicit knowledge entries. That is not a footnote. It is the main operational bottleneck.

If explicit memory is filled with stale, biased, inconsistent, or low-quality entries, inspectability merely makes the problem easier to admire. The model can retrieve the wrong thing more transparently. Useful, perhaps, but not salvation.

There are other boundaries.

First, the evaluation tasks are controlled and memory-derived. They are well suited to testing whether the architecture can use its Memory Bank, but they do not prove broad performance on open-ended enterprise question answering, agentic workflows, legal review, clinical decision support, or finance operations.

Second, the paper compares against a standard Transformer baseline, not a full modern production stack with RAG, tool use, reranking, guardrails, caching, monitoring, and human review. ExplicitLM’s real competition in business settings is not a naked Transformer. It is the messy but effective engineering pile companies actually deploy.

Third, retrieval remains imperfect. The perfect-retrieval experiment shows headroom, but not enough to justify naive confidence. Memory architecture does not remove the need for retrieval evaluation. It makes retrieval evaluation more central.

Fourth, memory entries are short token sequences. That is attractive for factual units, but enterprise knowledge often arrives as policies, contracts, exceptions, diagrams, tables, and contradictory emails from people with impressive titles. Compressing that into clean explicit units is a knowledge-engineering task. The 1990s called; it would like to remind us that ontology work was never dead, only unfashionable.

The larger shift: from bigger models to maintainable knowledge

ExplicitLM is valuable because it reframes a basic question. Instead of asking only how much knowledge a model can absorb, it asks how knowledge should be organised once the model is deployed.

That is the right question for business.

Scaling has made models capable. It has not made them maintainable. A model whose facts cannot be located, audited, updated, or removed is not a knowledge system in the enterprise sense. It is a powerful statistical instrument with a memory problem and excellent posture.

ExplicitLM does not solve that problem outright. It offers a design direction: keep linguistic competence in learned parameters, move selected factual knowledge into explicit memory, and train the model to use that memory as part of its own computation. The experiments show that this can improve controlled knowledge tasks, especially under low-data conditions. The diagnostic tests show that retrieval quality matters. The limitations show that curation and real-world validation remain stubbornly present, as reality often is.

The business takeaway is therefore disciplined: ExplicitLM is not a production-ready answer to AI knowledge governance. It is a serious architectural argument that governance may need to be designed into the model’s memory, not patched around the model’s outputs.

That is a quieter claim than most AI headlines. It is also more likely to matter.

Cognaptus: Automate the Present, Incubate the Future.


  1. Chengzhang Yu, Zening Lu, Chenyang Zheng, Chiyue Wang, Yiming Zhang, and Zhanpeng Jin, “ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks,” arXiv:2511.01581, 2025. ↩︎