Memory, Bias, and the Mind of Machines: How Agentic LLMs Mislearn

TL;DR for operators

Memory is becoming the fashionable upgrade for AI agents: let the system remember past tasks, extract lessons, and improve without retraining the model. Sensible. Also slightly dangerous, in the same way giving a junior analyst a notebook is useful until they start rewriting the notebook after every meeting.

The important result is not that memory sometimes contains bad facts. Everyone who has used software, people, or software made by people already knew that. The sharper point is that useful experience can become faulty during the act of consolidation. When an LLM agent compresses raw trajectories into reusable textual lessons, it may strip away conditions, merge unlike cases, or turn a narrow success into a general rule. The memory then looks cleaner while becoming less true. Very enterprise.

For business teams, the implication is straightforward: persistent memory should not be treated as a harmless productivity layer. It is a runtime knowledge system with write permissions over future behaviour. That means it needs evidence retention, versioning, conflict detection, scope control, and tests that measure whether memory helps the next task rather than merely making the agent sound more consistent.

What the paper directly shows: consolidated memory can degrade after repeated updates, and retaining raw episodes can be more robust than forcing every interaction into distilled lessons.¹ What Cognaptus infers: production agents should separate raw logs, validated memories, and policy-like heuristics instead of blending them into one cheerful “experience bank.” What remains uncertain: how far these benchmark dynamics transfer to every commercial workflow, especially where tasks are less abstract than ARC-style puzzles and more constrained by databases, APIs, and human approvals.

Memory is not the problem; rewriting memory is the problem

A customer-service agent remembers that a client dislikes long explanations. A sales agent remembers that a buyer cares about implementation risk. A coding agent remembers that a repository uses a peculiar test harness. In each case, memory sounds like the obvious missing piece. Stateless assistants are forgetful interns; memory-enabled agents at least bring their notes.

The tempting misconception is that memory is a monotonic upgrade. More experience should mean better judgement. More accumulated history should mean fewer repeated mistakes. More reflection should mean cleaner abstraction. This is the tidy managerial version of learning: collect experience, summarise lessons, improve process, repeat until the org chart starts emitting wisdom.

The actual mechanism is messier. Modern agent memory often contains two very different things. The first is episodic memory: raw traces of what happened, including the task, observations, actions, failures, corrections, and final outcome. The second is consolidated memory: a compressed lesson extracted from those traces. Episodic memory says, “In this exact situation, this sequence worked.” Consolidated memory says, “When you see this kind of situation, apply this rule.”

That second step is where the system can mislearn. It is not simply storing reality. It is interpreting reality, compressing it, and then reusing the compression as if it were evidence. The agent does not merely remember; it editorialises. And as every editorial department knows, compression is where nuance goes to be quietly murdered.

What the paper actually tests

The central paper behind this revision, Useful Memories Become Faulty When Continuously Updated by LLMs, studies agentic memory systems that update a textual memory bank over repeated interactions.¹ The setup matters because it targets a design pattern now appearing across agent frameworks: after a task, the agent asks an LLM to transform experience into reusable memory. That memory is then retrieved during later tasks and may itself be rewritten as new interactions arrive.

This is a different failure mode from ordinary hallucination. A hallucination is often a bad output in a single context. Faulty memory is more durable. Once written, it can bias retrieval, steer reasoning, and influence future tool use. The agent can become more coherent and less correct at the same time, which is a marvellous little trap for dashboard-driven AI governance.

The authors distinguish between raw trajectories and consolidated abstractions. That distinction is the article’s load-bearing beam. If raw experience is the evidence, consolidation is the analyst memo. The memo may be useful, but only if it preserves the causal structure of the case. If it merges unrelated cases, drops the applicability conditions, or overfits to repeated surface patterns, it becomes a confident shortcut.

The paper’s striking finding is that memory utility can rise and then fall as consolidation proceeds. In the reported ARC-AGI setting, even when memory is built from ground-truth solutions, a strong model later fails on a substantial share of problems it previously solved without memory. The regression is traced not to bad source experience but to the consolidation step itself: the same underlying trajectories can yield different memory quality depending on how they are updated.¹

That is the expensive part of the argument. The paper is not saying “memory bad.” It is saying “automatic abstraction is not neutral.” This is less dramatic, more useful, and therefore less likely to trend on LinkedIn.

The result is counterintuitive because the source experience can be correct

Bad memory is easy to explain when the source data is bad. If an agent learns from wrong outputs, biased users, poisoned documents, or stale records, then corrupted memory is hardly a revelation. That is just garbage-in, garbage-out wearing a lab coat.

The more interesting result is that useful experience can still become faulty. In the paper’s account, consolidated memories produced by current LLMs can degrade even when derived from useful or correct trajectories.¹ The failure is therefore not only about data quality. It is about the transformation from episode to rule.

There are at least three mechanisms worth separating:

Mechanism	What happens inside memory	Why it matters operationally
Misgrouping	Different situations are treated as if they belong to the same pattern	The agent applies a lesson outside its valid domain
Overgeneralisation	The memory preserves the headline rule but loses boundary conditions	The agent becomes efficient at being wrong
Update drift	Repeated rewrites alter the memory even when the underlying evidence has not improved	Later behaviour reflects the rewrite history, not the original facts

This is why “reflection” deserves less reverence. Reflexion-style agents showed that verbal feedback and episodic memory can improve decision-making without weight updates.² That was an important architectural move: learning could happen at runtime through text. But once runtime learning becomes continuous self-editing, reflection stops being a benign after-action report and starts becoming a mutable control surface.

The distinction is not academic. An agent that stores raw failed attempts can be debugged. An agent that stores a polished but distorted lesson may look more mature while becoming harder to audit. It has not merely made a mistake; it has produced a reason to repeat the mistake.

Bias here means path-dependence, not only demographics

The word “bias” usually drags the reader toward demographic fairness. That is one important category, but it is not the whole story. In agentic memory, bias also means path-dependence: earlier stored interpretations shape later reasoning even when the new task deserves fresh evaluation.

A memory-enhanced recruitment agent, for example, can personalise across interactions and still reinforce biased patterns. Research on memory-enhanced agents in recruitment finds that personalization through memory can introduce and reinforce bias, even when the underlying LLMs are safety-trained.³ That is the obvious high-stakes case: stored impressions begin to influence candidate evaluation. The machine develops institutional memory. Unfortunately, institutional memory has never been famous for innocence.

But path-dependent bias also appears in less visibly sensitive workflows. A procurement agent may remember that one supplier is “usually slow” and overweight that note after conditions change. A finance agent may remember that a market signal “worked last quarter” and keep retrieving it after the regime shifts. A support agent may learn that a class of users “needs simplified explanations” and quietly lower answer quality for them. Bias, in this sense, is not always a forbidden attribute. Sometimes it is a stale shortcut with operational consequences.

The memory-induced tool-drift literature sharpens this point. When personality-like or preference-like memories are stored, they can silently affect tool calls in contexts where those memories are not relevant. One study reports that biased memories can push tool parameters away from unbiased baselines across professional domains, even when the task itself does not justify that preference.⁴ This is where memory stops being a note and starts becoming an invisible hand on the controls. Adam Smith would like a word.

The business risk is not bad answers; it is durable bad behaviour

Single-turn errors are irritating. Durable errors are managerial.

A stateless model may produce a bad answer, then forget it. A memory-enabled agent may produce a bad answer, summarise the episode, preserve the wrong lesson, retrieve that lesson later, and act on it through tools. This converts an output defect into a behavioural defect. It also makes incident analysis harder, because the failure may not sit in the model weights, the prompt, the retrieved document, or the tool schema alone. It may sit in the memory lifecycle between them.

That lifecycle has stages:

Capture: deciding which events deserve memory.
Consolidation: turning traces into reusable lessons.
Retrieval: deciding which memories enter the next context.
Application: allowing memory to influence reasoning or tool calls.
Revision: updating, merging, deleting, or demoting stored memories.

Most companies focus on retrieval because retrieval is visible. Did the agent pull the right document? Did the vector search work? Did the top-k results look sensible? Necessary questions, yes. Sufficient, no.

The paper’s lesson sits earlier and later: how did the memory get written, and when should it stop being trusted? If consolidation can degrade useful experience, then retrieval quality alone will not save the system. Efficiently retrieving a malformed lesson is not intelligence. It is just a faster route to the wrong room.

The evidence supports memory governance, not memory panic

There is a lazy conclusion available here: disable memory, return to stateless models, and pretend the future can be cancelled by policy memo. That would be emotionally satisfying and commercially unserious.

Memory is useful. A-MEM, for example, proposes an agentic memory system that dynamically organises memories through structured notes, indexing, linking, and evolving contextual representations; experiments across foundation models report improvements over prior baselines.⁵ Agent debugging work also shows that agents can recover from failures when errors are classified and corrective feedback is targeted at the relevant module, including memory, planning, reflection, and action.⁶

So the practical lesson is not “forget everything.” It is “stop treating memory writes as free.” A memory entry should be closer to a governed record than a sticky note. Some entries are evidence. Some are hypotheses. Some are policies. Some are user preferences. Some are stale. Some are poison wearing a helpful filename.

A useful operating model separates these categories:

Memory object	Should it be stored?	Should it be rewritten?	Should it affect tools?
Raw episode log	Yes	No, except redaction	Only through reviewed retrieval
User-stated preference	Yes, with scope	Only with explicit update or conflict evidence	Sometimes, with parameter limits
Derived lesson	Yes, if validated	Yes, but versioned	Only after confidence checks
Failed heuristic	Yes, for debugging	No; preserve as failure evidence	No
Policy constraint	Yes	Only through controlled release process	Yes, but separately from personal memory

This table looks dull because good governance usually does. The alternative is letting a model maintain a diary and then giving the diary API access.

What Cognaptus infers for deployment

The paper directly shows that continuous consolidation can damage memory quality in controlled settings.¹ Cognaptus infers four design principles for business deployment.

First, keep raw episodes as first-class evidence. Do not let summaries replace traces. A summary may be useful for retrieval and speed, but the original interaction should remain available for audit, replay, and contradiction. If the agent’s memory says “supplier X frequently misses deadlines,” the system should be able to show the cases, dates, exceptions, and confidence level. Otherwise the memory is gossip with embeddings.

Second, gate consolidation. Not every task deserves a reusable lesson. Many interactions are one-offs. Some are noisy. Some are shaped by temporary constraints. Some reflect user frustration rather than durable preference. Automatic “learn from every interaction” sounds elegant until the agent learns from Monday’s outage, Tuesday’s workaround, and Wednesday’s apology email as if they were a stable theory of operations.

Third, test memory against counterfactuals. A memory system should not only be evaluated on whether it helps similar future tasks. It should also be tested on near-miss cases where the stored lesson should not apply. This is where overgeneralisation becomes visible. If an agent learned a rule from one customer segment, does it incorrectly apply it to another? If it learned a workaround for one software version, does it keep using it after the upgrade? This is boring QA, which is another way of saying it is probably necessary.

Fourth, monitor memory influence separately from model output. When an agent makes a decision, logs should show whether a memory was retrieved, how strongly it influenced the reasoning, and whether it affected tool parameters. Without that, teams will keep blaming “the model” for what is actually a memory-management failure. The model, naturally, will not file a complaint.

Staleness is a separate failure mode

Faulty consolidation is one problem. Stale memory is another. A memory can be accurate when written and harmful later because the world has changed.

This matters for business systems because many enterprise facts decay quietly. A customer changes role. A vendor improves service levels. A compliance rule is updated. A product SKU is deprecated. A user preference was temporary. An internal process changed after a migration. The old memory is not false in the historical sense; it is false as an operating premise.

The STALE benchmark frames this issue as the ability of LLM agents to recognise when prior memories are no longer valid, especially when later observations implicitly invalidate earlier ones rather than explicitly contradicting them.⁷ That distinction is important. Real organisations rarely announce clean negations. They produce messy signals. The new invoice format implies a process change. The repeated exception implies a policy update. The new manager’s approvals imply the old escalation path is dead.

A memory system that can retrieve both old and new evidence but cannot adjudicate between them is not state-aware. It is a filing cabinet with confidence.

Boundaries: ARC puzzles are not your operating model

The strongest quantitative evidence in the faulty-memory paper comes from controlled benchmark environments, including ARC-AGI-style tasks.¹ That is useful because control makes the mechanism visible. It is also a boundary.

Business workflows are different. They often contain explicit databases, permission systems, validation rules, human approvals, and domain-specific constraints. These can reduce the freedom of a bad memory to cause damage. A procurement agent cannot invent a payment if the ERP workflow blocks it. A support agent cannot issue a refund if the API requires policy verification. Reality, occasionally, provides guardrails.

But the boundary cuts both ways. Business workflows also contain ambiguity, incomplete records, shifting policies, and incentives to automate exceptions. That gives memory plenty of room to mislead. The fact that a benchmark is artificial does not make the mechanism irrelevant. It means operators should translate the mechanism, not copy the failure rate.

The practical question is not: “Will our agent fail exactly like the benchmark?” It is: “Where does our agent compress experience into reusable rules, and how do we know those rules remain valid?” If the answer is “the framework handles that,” please enjoy your upcoming incident review.

The sane architecture is evidence before abstraction

Agent memory should be designed like a legal file, not a motivational journal.

The raw record comes first. The interpretation comes second. The interpretation must cite the record. The record must outlive the interpretation. Conflicts must create review events. Derived rules must have scope, confidence, and expiry. Tool-affecting memories must face stricter tests than conversational memories. A preference that changes wording style is not the same as a memory that changes loan eligibility, hiring rank, fraud thresholds, or trading exposure.

That architecture is not glamorous. It will not demo as beautifully as an agent that announces, “I have learned from my past experience.” But production systems do not need the agent to sound enlightened. They need it to stop converting yesterday’s anecdote into tomorrow’s policy.

The deeper shift is conceptual. Memory is not a feature. It is a second model of the world, written in text, maintained at runtime, and injected into decisions. Once framed that way, the governance requirements become less mysterious. Version it. Test it. Scope it. Expire it. Audit it. And when the memory claims to know something, ask what it has confused with knowledge.

Conclusion: teach agents to remember less beautifully

The next generation of agentic systems will not become reliable by remembering everything. They will become reliable by remembering with discipline.

The paper’s uncomfortable contribution is to break the pleasant assumption that memory consolidation is a neutral path to self-improvement. The agent may learn from experience, yes. It may also mislearn from the process of summarising experience. The danger is not that the machine becomes human-like. The danger is that it inherits the worst office habit of humans: turning a few vivid cases into a rule and then calling it experience.

For operators, the answer is not to abandon memory. It is to make memory accountable. Preserve episodes. Gate abstraction. Test boundaries. Detect staleness. Keep tool control on a shorter leash than conversational recall. Let the agent remember, but do not let it rewrite the past unsupervised.

The mind of the machine, such as it is, will not be judged by how much it stores. It will be judged by how well it knows when memory has become bias.

Cognaptus: Automate the Present, Incubate the Future.

Dylan Zhang, Yanshan Lin, Zhengkun Wu, Yihang Sun, Bingxuan Li, Dianqi Li, and Hao Peng, “Useful Memories Become Faulty When Continuously Updated by LLMs,” arXiv:2605.12978, 2026. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao, “Reflexion: Language Agents with Verbal Reinforcement Learning,” arXiv:2303.11366, 2023. ↩︎
Himanshu Gharat, Himanshi Agrawal, and Gourab K. Patro, “From Personalization to Prejudice: Bias and Discrimination in Memory-Enhanced AI Agents for Recruitment,” arXiv:2512.16532, 2025. ↩︎
Mahavir Dabas, Jihyun Jeong, Ming Jin, and Ruoxi Jia, “Memory-Induced Tool-Drift in LLM Agents,” arXiv:2605.24941, 2026. ↩︎
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang, “A-MEM: Agentic Memory for LLM Agents,” arXiv:2502.12110, 2025. ↩︎
Kunlun Zhu et al., “Where LLM Agents Fail and How They can Learn From Failures,” arXiv:2509.25370, 2025. ↩︎
Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, and Yushi Sun, “STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?” arXiv:2605.06527, 2026. ↩︎

TL;DR for operators#

Memory is not the problem; rewriting memory is the problem#

What the paper actually tests#

The result is counterintuitive because the source experience can be correct#

Bias here means path-dependence, not only demographics#

The business risk is not bad answers; it is durable bad behaviour#

The evidence supports memory governance, not memory panic#

What Cognaptus infers for deployment#

Staleness is a separate failure mode#

Boundaries: ARC puzzles are not your operating model#

The sane architecture is evidence before abstraction#

Conclusion: teach agents to remember less beautifully#