Memory is usually sold as a virtue.
An AI agent with memory sounds safer, smarter, more personal, more autonomous. A warehouse robot remembers where boxes were placed. A navigation agent remembers which corridor led to the exit. A workflow agent remembers what the user asked yesterday and uses that context tomorrow. This is the comforting version of memory: the past as an asset.
Now add one small inconvenience: the world changes.
The box is moved. The corridor sign is replaced. The user’s instruction is superseded by a later one. Yesterday’s state is no longer context; it is contamination. At that point, the agent’s problem is not whether it can remember. The problem is whether it can stop obeying the wrong memory.
That is the central idea behind Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning, a 2026 paper by Oleg Shchendrigin, Egor Cherepanov, Alexey K. Kovalev, and Aleksandr I. Panov.1 The paper is not mainly another “agent memory benchmark” paper, although it does introduce benchmarks. Its sharper claim is mechanical: memory in reinforcement learning should not be treated as a storage buffer. It should be treated as an update process, where old beliefs are selectively preserved, weakened, overwritten, or reorganized.
In other words, the uncomfortable skill is not recall. It is controlled forgetting.
Retention is easy to admire because it is easy to measure
Most memory benchmarks in reinforcement learning have a familiar shape. The agent sees a cue, the cue disappears, and the agent must use that cue later. A T-Maze is the classic toy version: the agent observes “left” or “right” at the start, walks forward, and must choose the correct turn at the end.
This is a useful diagnostic, but it tests a narrow memory skill. It rewards persistence. It asks whether information can survive delay.
That is not the same as asking whether the agent can revise its internal state when the original information becomes false.
The paper formalizes this distinction with a simple memory-update view. An agent has a memory state $m_t$. New input arrives. The next memory state is not just “old memory plus new token.” It can be decomposed into three operations:
Here, $F_\phi$ selects or forgets parts of the old memory, $E_\phi$ encodes the new input, and $W_\phi$ integrates both. This equation is not decorative math. It is the paper’s conceptual lever. It says that forgetting is not a bug in the memory system. It is one of the operations required for memory to remain useful.
A retention-only agent behaves as if the past deserves default authority. A rewriting-capable agent behaves as if the past is a hypothesis under revision. That difference matters under partial observability, where the agent cannot simply look at the full current state and correct itself. It must act through an internal belief state, and that belief state can go stale.
For business readers, this is the first useful correction. Longer context windows, larger memory stores, and cached histories do not automatically solve memory-heavy agency. They may simply preserve more obsolete information with greater confidence. Very enterprise. Very expensive. Very annoying.
Endless T-Maze turns memory from storage into replacement
The paper introduces two benchmark families. The first is Endless T-Maze.
The name is almost too friendly. The task is not.
In the classic T-Maze, the agent receives one cue and remembers it until one decision point. Endless T-Maze chains this structure into repeated corridors. At the beginning of each corridor, a fresh left/right cue appears. That cue invalidates the previous one. Only the newest instruction matters.
The task therefore changes the role of memory. The agent is no longer rewarded for preserving the first cue. It must overwrite the previous cue at each stage and use the currently valid one at the next junction.
The authors vary the corridor length $l$, the number of corridors $n$, and the sampling regime. In the Fixed regime, corridor lengths are constant. In the Uniform regime, corridor lengths vary randomly. That distinction is important because it separates agents that can learn a predictable rhythm from agents that can adapt to variable timing.
The paper’s Table 1 gives the cleanest signal.
In pure retention cases where $n=1$, PPO-LSTM, SHM, and FFM perform well in many settings. For example, PPO-LSTM reaches $1.00 \pm 0.00$ success in all $n=1$ Endless T-Maze rows, and FFM also reaches perfect or near-perfect scores in those rows. This tells us something limited but useful: several memory mechanisms can preserve a cue long enough to act on it.
The story changes when $n>1$, because now old cues become liabilities.
PPO-LSTM is the strongest overall. In the Fixed regime, it achieves perfect success in most non-trivial configurations, although it drops to $0.69 \pm 0.31$ for $l=10,n=5$ and $0.83 \pm 0.17$ for $l=10,n=10$. In the Uniform regime, it is strikingly robust, reaching $1.00 \pm 0.00$ in most cases and $0.98 \pm 0.02$ for $l=10,n=3$.
FFM also performs strongly in Fixed settings, reaching $1.00 \pm 0.00$ across many non-trivial Fixed rows. But its performance collapses in Uniform settings as soon as the timing becomes less predictable. With $l=5,n=5$, FFM falls to $0.03 \pm 0.03$; with $l=5,n=10$, it reaches zero; with $l=10,n=3$ and beyond, it also reaches zero.
That contrast is the paper’s mechanism-first lesson. FFM can forget, but its forgetting appears tied to predictable structure. It can exploit a pattern. It cannot reliably adapt the pattern when timing becomes stochastic.
SHM is more limited. It performs respectably in some Fixed rows, such as $0.93 \pm 0.07$ for $l=10,n=3$, but collapses in harder Fixed and most Uniform rewriting cases. GTrXL, the transformer-based architecture, performs poorly across most Endless T-Maze configurations. The memory-free PPO-MLP behaves as expected: when memory is required, it has very little to say. Silence, in this case, is mathematically appropriate.
The ranking is not “old models beat new models.” It is “adaptive update beats passive context.”
It would be tempting to turn the result into a lazy slogan: LSTMs beat transformers. That would be both memorable and wrong in the usual internet way.
The more precise reading is this: architectures differ in how they manage obsolete information.
| Architecture family | What the paper tests | Observed pattern | Business translation |
|---|---|---|---|
| PPO-LSTM | Recurrent memory with explicit learnable gates | Strongest across rewriting and generalization tests | Adaptive forgetting can be more valuable than simply preserving more history |
| FFM | Structured memory with systematic forgetting | Strong in predictable Fixed T-Maze, weak in stochastic Uniform settings | Rule-like decay may work when operations are regular, but degrade when timing changes |
| SHM | Structured memory with dynamic significance updates but no explicit forgetting mechanism | Some Fixed success, weaker generalization | Updating importance is not the same as reliably deleting stale belief |
| GTrXL | Transformer-based cached context with gating/stabilization | Often poor in Endless T-Maze, with limited partial progress | Attention over history does not guarantee selective revision |
| PPO-MLP | No explicit memory | Weak when memory matters | No surprise; at least this baseline fails honestly |
The important distinction is not “modern” versus “classic.” The important distinction is whether the architecture has a mechanism that can decide what part of the old state should lose authority.
That is why the GTrXL result is useful even though it should not be overgeneralized into a grand verdict on all transformers. A transformer can attend to history. But attending to history is not equivalent to rewriting belief. If stale context remains available and there is no sufficiently effective selective deletion process, the agent can keep reasoning through outdated traces.
In language-model products, this failure mode is already familiar. A model remembers an old preference after the user corrected it. A support agent continues following a previous workflow after the ticket status changes. A planning agent keeps optimizing against a constraint that was removed ten minutes ago. The agent is not “forgetful.” It is worse: it remembers too well and updates too weakly.
The figures separate failure from partial failure
The paper’s figures are not just decorative summaries. They clarify what kind of failure is happening.
Figure 4 looks at intermediate progress for SHM and GTrXL in Endless T-Maze. This is not the main leaderboard result. It is a diagnostic extension: if an agent fails the full task, how far does it get before failure?
That matters because a zero or low success rate can hide different behaviors. One agent may fail immediately. Another may pass several corridors and then lose track. Figure 4 shows that SHM and GTrXL can sometimes make partial progress, often navigating close to the required number of corridors before failing. In tasks with $n=3$ and $n=5$, the paper notes that both agents often manage one corridor fewer than required.
This is an important interpretive detail. These architectures are not necessarily incapable of any short-term cue update. Their failure is more specific: their memory representations deteriorate as rewriting demands accumulate.
For system designers, this is the difference between a useless agent and a dangerously plausible one. The dangerously plausible agent works for the first few state changes, passes demos, handles easy scripts, and then fails when the sequence becomes longer or less regular. That is exactly the kind of failure that survives procurement.
Figure 5 then tests interpolation and extrapolation in Endless T-Maze. This is closer to a robustness and generalization test than a second thesis. The question is whether agents trained under one number of corridors or corridor length can handle easier or harder variants.
Here again, PPO-LSTM sits at the top. It interpolates well and shows incomplete but successful extrapolation to longer corridor lengths in the relevant Uniform setting. FFM generalizes within Fixed configurations, but its flexibility remains conditional. SHM and GTrXL are more restricted, with SHM especially rigid.
So the hierarchy is not only about fitting the training task. It is about whether the learned memory policy transfers when the rhythm changes.
Color-Cubes shows where even adaptive forgetting is not enough
Endless T-Maze isolates rewriting in a clean way. Color-Cubes makes the problem messier.
In Color-Cubes, the agent moves in a grid containing colored cubes. At the start of a phase, it observes the target color and the cube configuration. Then it must navigate to the right cube and interact with it. Non-target cubes can teleport. Updates are only revealed under specific events, so the agent must rely on memory between updates.
The benchmark has three modes:
| Mode | What it tests | Result pattern |
|---|---|---|
| Trivial | One cube, one target, no real rewriting | GTrXL, SHM, and FFM reach $1.00 \pm 0.00$; PPO-LSTM reaches $0.52 \pm 0.10$; MLP scores zero |
| Medium | Multiple cubes with complete position/color updates | All baselines score zero |
| Extreme | Multiple cubes with incomplete updates; colors hidden during teleportation updates | All baselines score zero |
The Trivial row is a useful sanity check, and also a warning against cartoon interpretations. PPO-LSTM is not universally superior. In Trivial Color-Cubes, it performs worse than GTrXL, SHM, and FFM. The authors suggest it may not be robust to grid environments in that case. Since Trivial Color-Cubes does not require rewriting, this does not overturn the main rewriting claim. It simply prevents a lazy “LSTM wins everything” conclusion.
Medium and Extreme are more severe. All tested baselines score zero.
That result should not be softened. It is the boundary of the paper’s optimism. Endless T-Maze shows that adaptive forgetting helps with repeated cue replacement. Color-Cubes shows that once the agent must dynamically update a structured internal map, re-rank stored information, and in Extreme mode infer hidden color-position relationships, none of the evaluated mechanisms is enough.
This matters because many real operational settings look more like Color-Cubes than T-Maze.
A warehouse robot does not only replace “turn left” with “turn right.” It tracks objects, locations, statuses, priorities, and exceptions. A logistics agent does not only overwrite one route instruction. It updates a changing map of vehicles, constraints, customer commitments, weather disruptions, and warehouse availability. A business-process agent does not only remember the latest command. It must maintain which information is still valid, which is superseded, which depends on other variables, and which must be reconstructed from partial evidence.
That is not simple forgetting. It is belief-state maintenance under change.
The paper’s Color-Cubes failure is therefore not an embarrassing footnote. It is probably the most business-relevant result in the study.
The ablation points to gates, but not magic gates
The authors then run an ablation study comparing PPO-LSTM with PPO-GRU and PPO-RNN in Endless T-Maze. The purpose is not to create a separate benchmark story. It is to test the likely mechanism behind PPO-LSTM’s advantage.
The comparison is well chosen:
- PPO-RNN has recurrence but no gating mechanism.
- PPO-GRU has gates, but fewer and simpler gates than LSTM.
- PPO-LSTM has separate cell and hidden states, along with input, output, and forget gates.
The reported pattern supports the gating hypothesis. PPO-RNN succeeds only in limited interpolation and extrapolation settings, similar to cases where SHM and FFM had partial success. PPO-GRU performs better, showing interpolation and extrapolation not only across corridor counts but also limited generalization across corridor length. PPO-LSTM remains stronger, especially in length generalization.
The careful interpretation is this: gating helps, and adaptive forgetting appears especially important. The careless interpretation would be: add an LSTM and enjoy intelligence. Please do not.
The ablation supports an architectural principle, not a product recipe. The principle is that an agent needs trainable control over memory flow: when to retain, when to attenuate, when to replace, and how strongly to let new evidence override old state. LSTM’s forget gate is one implementation of that principle. It is not the final form of agent memory.
This distinction matters because the paper’s hardest benchmark already shows the gap. Even with LSTM, Medium and Extreme Color-Cubes remain unsolved by the tested baselines. So adaptive forgetting is necessary-looking, but not sufficient.
What the paper directly shows, and what business should infer
The paper directly shows three things.
First, the authors introduce diagnostic environments designed to test memory rewriting rather than simple retention: Endless T-Maze and Color-Cubes. Endless T-Maze isolates sequential cue replacement. Color-Cubes stresses structured memory updating under partial observability.
Second, across these controlled experiments, PPO-LSTM is the strongest architecture for rewriting and generalization in Endless T-Maze. FFM and SHM can succeed in narrower, more predictable settings. GTrXL often fails beyond trivial or narrow cases in this benchmark setup. PPO-MLP confirms that memory-free control is not enough when the current observation does not contain the needed state.
Third, the ablation study supports the importance of gating and especially adaptive forgetting. PPO-GRU improves over ungated RNN behavior, while PPO-LSTM remains stronger in generalization over corridor lengths.
Now the business inference.
The most useful operational question is not, “Does our agent have memory?” That question is too easy. Almost every modern agent product can claim some memory layer, retrieval system, cache, or state store.
The better question is:
Can the agent detect when remembered information has lost authority?
That question should influence how companies test autonomous systems in robotics, routing, monitoring, customer operations, and workflow automation. A memory evaluation suite should include state changes that invalidate earlier information. It should include repeated updates, irregular timing, and partial observability. It should also distinguish between one-variable overwriting and structured belief maintenance.
A practical evaluation table might look like this:
| Operational test | What it checks | Example failure |
|---|---|---|
| Superseded instruction test | Whether the agent follows the latest valid instruction | Agent obeys an earlier user or supervisor command |
| Moving-object test | Whether the agent updates object-location memory | Robot goes to where the item used to be |
| Irregular update timing | Whether the agent depends on predictable rhythms | Agent works during scripted demos but fails under stochastic change |
| Partial update test | Whether the agent can infer state from incomplete evidence | Agent cannot reconstruct which entity changed |
| Long sequence rewrite test | Whether memory quality decays after repeated updates | Agent handles two changes but fails after five |
This is where the paper becomes valuable for business without becoming a sales brochure. The value is not that it proves which architecture to deploy. It does not. The value is diagnostic: it tells us what kind of hidden failure current agent benchmarks may be missing.
The uncomfortable boundary: real-world memory is harder than these tests
The limitations are not ceremonial here. They shape how the paper should be used.
The benchmarks are controlled. That is a strength for diagnosis, but a boundary for deployment claims. Endless T-Maze is deliberately minimal. Color-Cubes is more structured, but still a grid-world abstraction. The experiments compare a selected set of recurrent, transformer-based, structured-memory, and memory-free baselines. A broader set of architectures, training regimes, and environments would be needed before making sweeping claims about “the best” agent memory design.
The paper also points out that memory rewriting in RL has relatively limited direct prior literature, which makes comparison harder. That does not weaken the benchmark contribution. It explains why the field may have been measuring the wrong thing with great confidence. A classic academic hobby.
The most important boundary is empirical: all tested baselines fail on Medium and Extreme Color-Cubes. That means the paper is not saying, “We have solved adaptive memory.” It is saying, “Here is a sharper way to expose that we have not.”
For enterprise agent builders, that is actually more useful. A benchmark that reveals unsolved failure is more valuable than another benchmark where everyone gets a pleasant number and nobody changes design.
Forgetting is not deletion; it is governance over belief
The word “forgetting” can mislead. In business systems, forgetting sounds like data loss, compliance risk, or model degradation. That is not what the paper is arguing for.
The point is not to erase history indiscriminately. The point is to manage the authority of remembered information.
Old information may still be useful as history. It may be useful for audit, explanation, or long-term pattern analysis. But it should not automatically remain active as the agent’s current belief about the world.
A good memory system should separate at least three layers:
| Layer | Question | Business example |
|---|---|---|
| Historical record | What happened before? | Audit trail of prior instructions and state changes |
| Current belief | What is true enough to act on now? | Latest confirmed location, status, or user preference |
| Revision logic | When does old information lose authority? | Rules or learned mechanisms for supersession, decay, contradiction, and inference |
Most product discussions collapse these layers into “memory.” That is convenient, and therefore suspicious.
An agent that stores everything but cannot rank current relevance is not memory-rich. It is administratively cluttered. An agent that retrieves old facts without knowing whether they were superseded is not context-aware. It is haunted.
The paper’s deeper contribution is to push RL memory evaluation toward this governance question: not how much past information can be preserved, but how memory is transformed when the environment changes.
Conclusion: the next agent benchmark should punish nostalgia
“Remember more” has become one of the easy slogans of agent design. It sounds sensible. It also avoids the harder engineering question.
The world is not an append-only log. Operational environments change, and under partial observability the agent must maintain a belief state that can be revised before failure becomes visible. The paper’s benchmarks make that problem concrete. Endless T-Maze shows that repeated cue replacement separates adaptive memory from passive retention. Color-Cubes shows that structured, partially observed updating remains unsolved for the tested baselines. The ablation study points toward gating and adaptive forgetting as useful ingredients, while also showing they are not the entire recipe.
The business lesson is plain: do not evaluate an agent only on whether it remembers. Evaluate whether it can demote stale information, incorporate new evidence, and act from the latest valid belief.
Memory is useful. Nostalgia is not.
Cognaptus: Automate the Present, Incubate the Future.
-
Oleg Shchendrigin, Egor Cherepanov, Alexey K. Kovalev, and Aleksandr I. Panov, “Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning,” arXiv:2601.15086, 2026, https://arxiv.org/abs/2601.15086. ↩︎