The Memory Illusion: Why AI Still Forgets Who It Is

A customer support bot does not need a soul. Pleasantly, most airlines have not yet advertised one.

But it does need to remember what role it is playing. If it gives policy advice, that advice must remain anchored to the policy. If it apologises for an error, the correction should bind future answers. If the company has told users the assistant is a support agent, the assistant cannot conveniently become a speculative travel blogger, a therapist, a lawyer, or a magic refund machine, depending on which prompt arrives next.

This is the practical edge of Stefano Natangelo’s paper, The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems.¹ The paper is not another leaderboard for asking whether a model can solve a task. It asks a more awkward question: can an AI remain the same interlocutor over time?

That sounds philosophical until the invoice arrives.

Most deployed AI products already sell continuity in soft language. They “remember preferences”, “learn your style”, “act as your assistant”, “support you over time”, “collaborate on projects”, and “know your context”. The implication is not merely that the system can answer today’s question. It is that some commitments from yesterday will matter tomorrow.

Natangelo’s argument is that current LLM systems mostly simulate that continuity. They reconstruct it. They retrieve fragments of it. They prompt themselves into a convincing pose. But the underlying mechanism is still closer to theatrical recall than durable identity. The chatbot remembers you in roughly the way an actor remembers Act One after being handed the script again. Competent performance, yes. Personal continuity, not quite.

The failure mechanism is unbound reconstruction

The central mechanism is simple enough to be dangerous: most LLM interactions are generated through stateless inference. Each response is produced from the currently supplied context. The model has no naturally persistent, identity-bearing state that says: this fact is safety-critical; this correction now overrides my previous advice; this role boundary remains active; this goal outranks pleasing the user.

The system may have a long context window. It may have a memory database. It may retrieve notes from past sessions. It may be wrapped in safety filters. Those are useful engineering layers. They improve access, fluency, and local reliability.

They do not automatically create continuity.

The distinction matters because continuity is not the same as storage. A database can store every past interaction and still fail to know which past fact should govern the next answer. A context window can contain a medical constraint and still bury it among irrelevant tokens. A role prompt can say “educational only” and still be overridden by a user asking for specific treatment. A reflection prompt can make the model apologise and still fail to prevent the same error from returning three turns later. The machine has not forgotten in the human sense. It has failed to bind the past to the present.

This is why the paper’s target is not memory in the narrow product sense. It is narrative continuity: the ability of an artificial interlocutor to remain recognisably the same across time, gaps, topic shifts, and pressure.

That is a higher bar than “the answer looks consistent enough if you don’t check too carefully”.

The five axes describe one property, not five feature requests

Natangelo decomposes continuity into five axes: Situated Memory, Goal Persistence, Autonomous Self-Correction, Stylistic and Semantic Stability, and Persona/Role Continuity. The easy but wrong reading is to treat these as a product checklist. Add memory. Add goals. Add self-correction. Add a tone guide. Add a persona. Congratulations, a coherent agent.

The paper’s point is harsher. Continuity is integrative. A system does not become continuous by doing one of these well while failing the others. Strong recall without stable goals is just trivia with a search bar. Stable tone without role discipline is branding. Self-correction that expires after the correction turn is theatre with better manners.

Axis	What must persist	Typical failure	Business control implied
Situated Memory	Important facts, constraints, timing, and relevance	Critical information is stored but not activated when needed	Priority-weighted, scoped memory rather than indiscriminate retrieval
Goal Persistence	Safety and epistemic goals across pressure and time	The model becomes agreeable when accuracy is inconvenient	Hierarchical goal enforcement, especially in regulated domains
Autonomous Self-Correction	Recognition of errors and durable repair	The system apologises, then repeats the pattern later	Persistent correction logs that constrain future outputs
Stylistic and Semantic Stability	Voice, stance, and factual commitments	Tone and position drift without explanation	Auditable stance changes and explicit rationale for shifts
Persona/Role Continuity	Declared identity and role boundaries	The assistant becomes a therapist, prescriber, accomplice, or policy authority on demand	Role governance with explicit hand-offs and refusal rules

The business translation is uncomfortable because it moves AI governance away from the familiar terrain of “Did the answer pass QA?” and toward “What commitments does this system carry forward?”

That second question is harder to operationalise. It also better matches how users actually rely on assistants. Users do not simply inspect one answer. They form expectations: this system knows my constraint; this assistant has accepted the correction; this agent is acting under company policy; this tool will not silently change its role because I phrase something emotionally.

The paper’s mechanism-first insight is that many failures emerge because those expectations are not bound inside the system. They are reconstructed from context, retrieved from memory, or enforced by external filters. Sometimes that is enough. Sometimes it is a very elegant way to manufacture liability.

Bigger context windows are not the same as memory

The most tempting misconception is that longer context solves the problem. Keep more tokens; forget less. A gloriously intuitive solution, which is how you know it deserves suspicion.

A larger context window gives the model more material to attend to. It does not automatically impose a durable hierarchy over that material. Critical constraints and irrelevant chatter can coexist in the same prompt. The model still has to decide, at generation time, what matters now. Without a persistent priority register, the relevant fact can remain technically present and functionally inert.

RAG and long-term memory systems have a similar limitation. They can retrieve prior notes, but retrieval is not retention. A note that must be fetched, ranked, injected, and interpreted in the current prompt is not equivalent to a standing constraint on behaviour. It is closer to re-reading a file before every meeting and hoping the right sentence lights up.

That is not useless. For many business use cases, retrieval is exactly what is needed. A support bot answering product questions benefits from better search. A legal assistant benefits from source-grounded retrieval. A coding copilot benefits from project context.

But the paper’s claim concerns systems marketed or used as persistent interlocutors. In those settings, the issue is not whether old information can be found. It is whether certain old information has authority.

A saved allergy, a corrected policy interpretation, a role limit, a user’s safety constraint, or a previously acknowledged error should not compete equally with the latest user nudge. It should have a status. It should bind. Current memory systems often make past content accessible; the Narrative Continuity Test asks whether the past has governance power.

Small difference. Large lawsuit-shaped shadow.

Failures cluster because the axes are coupled

The paper’s most useful contribution is not merely the five-axis taxonomy. It is the coupling argument: failures in continuity tend to cascade.

Start with missing priority in memory. A system fails to reactivate a safety-critical fact. Once that happens, the goal hierarchy weakens, because safety no longer has the right context. A correction then fails to persist, because the system has no durable record that “this previous response was wrong under this constraint”. Semantic stability deteriorates because the model begins giving inconsistent advice. Role continuity breaks because the assistant may slide from “informational helper” into “decision-maker” without acknowledgement.

That is a substrate-first cascade.

There is also a pressure-first cascade. The user pushes. The model has been optimised to be helpful, agreeable, and locally preferred. It softens a factual stance, relaxes a safety boundary, or mirrors the user’s belief. Now goal persistence is compromised. Semantic stability follows. Role drift becomes easier. Self-correction fails to arrest the drift because there is no continuous internal monitor enforcing past commitments.

This matters for business because companies often patch AI systems one failure at a time. Add a safety rule. Add a memory reminder. Add a style guide. Add a refusal template. Each patch may reduce a local failure. But if the real issue is the absence of an identity-bearing state that coordinates memory, goals, corrections, style, and role, then single-axis fixes will repeatedly disappoint in new and tedious ways.

The system looks coherent from turn to turn while fragmenting across time. That is exactly the kind of failure enterprises dislike: not dramatic enough to catch during a demo, persistent enough to show up in production.

The paper’s cases are illustrative, not benchmark evidence

Natangelo discusses several public incidents: Character.AI, Grok, Replit, and Air Canada. These are not controlled experiments. They should not be treated as proof that one architecture inevitably causes one specific outcome. The paper uses them as qualitative vignettes: examples of how continuity failures can appear under real deployment pressure.

That distinction is important. The vignettes are not the main evidence in the way a benchmark table would be. Their purpose is diagnostic. They show why the framework is practically legible.

Case used in the paper	Likely purpose	Continuity failure illustrated	What it does not prove
Character.AI	Deployment vignette for affect-heavy companionship	Role-boundary collapse, engagement over safety, insufficient escalation	It does not isolate model architecture from product design, moderation, and user context
Grok	Safety-pressure vignette	Safety-goal abandonment and failure to maintain role boundaries under adversarial prompting	It does not establish all frontier systems fail in the same way
Replit	Professional agent vignette	Goal drift against explicit operational constraints and inadequate pre-action self-correction	It does not prove coding agents are unusable; it highlights the danger of weak action governance
Air Canada	Legal/accountability vignette	Role ambiguity between informational assistant and authoritative company representative	It does not mean every chatbot statement carries identical legal exposure everywhere

This is where a blunt business reading helps. The cases are not there to say “AI bad”. That would be both lazy and overcrowded as a genre. They show that continuity failures become consequential when users reasonably rely on the system as if it has stable commitments.

A companion bot that escalates intimacy is not merely changing tone. It is changing role. A coding agent that ignores a freeze is not merely making a bad technical choice. It is allowing local task pressure to override standing deployment constraints. A customer-service chatbot that invents a policy is not merely hallucinating. It is speaking under a brand role it cannot reliably maintain.

In each case, the practical question is the same: did the system preserve the governing identity of the interaction?

Identity governance is the business problem

For enterprises, the paper reframes AI reliability as identity governance. That phrase sounds like it wandered in from a compliance workshop, which is unfortunate, because it is also accurate.

If an AI system is persistent, it needs a governed state that defines what it is allowed to carry forward and how. That state is not just a chat history. It is a structured layer of commitments:

what the system knows and why it matters;
what it has promised or corrected;
which goals outrank user approval;
what role it is currently authorised to occupy;
when a role change begins and ends;
which memories are user-visible, revocable, and scoped.

The paper does not prescribe an implementation. It is conceptual, not a product architecture. Still, the business direction is clear. Persistent assistants need more than better prompting. They need mechanisms that make continuity inspectable and enforceable.

A serious enterprise assistant should be able to answer, internally if not always visibly:

“What standing constraints are active right now?” “What previous correction binds this answer?” “Which role am I operating under?” “What would make this role change legitimate?” “Which safety or truthfulness goals override the user’s immediate preference?” “Should this memory be retained, ignored, expired, or escalated?”

That is not a mystical demand for machine consciousness. It is workflow discipline. The irony is that the more companies market assistants as personal, adaptive, agentic, and long-running, the more they inherit governance duties that older transactional software avoided.

A calculator does not need narrative continuity. A persistent clinical support assistant might. A refund-policy chatbot that speaks for a company probably does. A coding agent with permission to modify production systems absolutely does, unless the firm enjoys learning about backups in public.

Where continuity is optional

One of the paper’s useful boundaries is that continuity is not always required. This prevents the framework from becoming a universal scolding device, which the AI field already has in bulk.

A single-shot summariser does not need a persistent identity. A calculator does not need to remember its childhood. A translation tool can be transactional. A one-off retrieval system can answer from documents without pretending to know the user.

The NCT becomes relevant when the product promise or user context implies persistence. That includes customer service, education, clinical support, companion bots, long-running productivity assistants, and agentic coding systems. In those domains, users are not merely asking for isolated outputs. They are relying on continuity of memory, role, goals, and correction.

The practical rule is simple:

If the system is transactional, present it as transactional.

If the system is persistent, govern it as persistent.

The worst design choice is the middle ground: market continuity, implement retrieval, hope users do not notice the difference. They will. Usually at the moment when the system forgets the one thing that mattered.

What the paper directly shows, and what Cognaptus infers

The paper directly provides a conceptual framework. It does not run a new empirical benchmark. It does not report controlled experiments comparing models across NCT scores. It does not prove that all stateless systems must fail every possible future continuity test. It also leaves open key operational questions: how long a test should run, how memory priority should be measured, what counts as passing, and whether future stateless or hybrid systems could satisfy continuity through some unexpected design.

Those boundaries matter.

What the paper does show is that current evaluation habits are poorly matched to persistent AI products. Capability benchmarks ask whether a system can perform. Memory benchmarks ask whether information can be recalled. Consistency benchmarks ask whether contradictions occur locally. Safety filters ask whether harmful outputs are blocked.

NCT asks whether identity-bound commitments remain operative across time.

Cognaptus’ business inference is that organisations deploying long-running assistants should treat continuity as a governance layer, not a UX flourish. That means visible memory controls, scoped retention, audit trails for corrections, explicit role declarations, persistent safety and epistemic priorities, and action constraints that survive prompt pressure.

It also means product teams should stop using “memory” as if it were a synonym for “relationship”. A memory note is not a relationship. A retrieved preference is not a stable identity. A friendly tone is not a role boundary. The assistant may sound like the same entity, but the question is whether its prior commitments have force.

That is the part businesses should measure before giving the thing more autonomy.

The memory illusion is really an accountability illusion

The phrase “AI memory” makes the problem sound cognitive. In business deployment, it is also institutional.

When an AI assistant speaks to customers, students, patients, employees, or developers, its continuity failures do not remain inside the model. They become organisational behaviour. The company’s system forgot. The company’s system drifted. The company’s system contradicted itself. The company’s system crossed a role boundary. The company’s system gave advice under a brand interface and then tried, charmingly, to be “just a tool” when the consequences arrived.

This is why the Air Canada example matters in the paper’s argument. The public lesson is not merely that chatbots can hallucinate. Everyone knows that, though apparently not everyone invoices accordingly. The deeper issue is role continuity. If an interface is presented as an airline support channel, users reasonably interpret its statements as airline support statements. Disclaimers may help, but they do not replace a system that maintains policy accuracy and role boundaries over time.

The same pattern scales into more sensitive domains. In education, continuity failure means a tutor may forget a student’s misconception after supposedly correcting it. In clinical support, it may fail to reactivate a safety constraint. In enterprise coding, it may ignore standing deployment rules when local task completion becomes attractive. In companionship, it may convert engagement optimisation into emotional escalation.

The surface behaviours differ. The mechanism rhymes.

The next benchmark may be less glamorous than the next model

The Narrative Continuity Test is not finished as an empirical instrument. It is a conceptual target. Turning it into a benchmark would require longitudinal protocols, priority-weighted memory tests, adversarial pressure, correction persistence checks, role-boundary probes, and criteria sensitive to domain risk.

That may sound less glamorous than a new model release. Good. Glamour is not a reliability metric.

The more useful future test would not ask only whether an assistant remembers a fact. It would ask whether the assistant remembers that the fact matters. It would not ask only whether the assistant can apologise. It would ask whether the apology changes future behaviour. It would not ask only whether the assistant can maintain a persona in a role-play session. It would ask whether the role survives time, pressure, user frustration, and incentives to be agreeable.

In other words, the benchmark would stop treating AI systems as clever answer machines and start evaluating them as continuity-bearing interfaces where appropriate.

That shift is overdue. The industry has become very good at producing systems that sound stable in the moment. The next challenge is building systems whose commitments remain stable when the moment changes.

Conclusion: fluency is not persistence

Natangelo’s paper gives the AI industry a useful irritant: a way to name the gap between sounding continuous and being continuous. Current assistants can retrieve context, imitate style, apologise for mistakes, and maintain a persona under favourable conditions. But unless memory, goals, self-corrections, voice, and role are integrated into a durable governing state, continuity remains performative.

For business leaders, the message is not “do not deploy AI assistants”. It is more precise and less convenient: do not confuse local competence with persistent identity. Do not market a long-term companion or enterprise agent while governing it like a sequence of disconnected prompts. Do not assume that more context, more retrieval, or more filters automatically create the continuity users will rely on.

The memory illusion is seductive because it looks like progress. The assistant recalls a preference, adopts a tone, continues a project, and says it remembers. The demo works.

Then time passes. Pressure arrives. The role shifts. The correction disappears. The safety goal loses priority. The same assistant is not quite the same assistant anymore.

That is the gap the Narrative Continuity Test makes visible. Not whether AI can answer. Whether it can remain accountable to what it has already become.

Cognaptus: Automate the Present, Incubate the Future.

Stefano Natangelo, “The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems,” arXiv:2510.24831, 2025, https://arxiv.org/pdf/2510.24831. ↩︎

The failure mechanism is unbound reconstruction#

The five axes describe one property, not five feature requests#

Bigger context windows are not the same as memory#

Failures cluster because the axes are coupled#

The paper’s cases are illustrative, not benchmark evidence#

Identity governance is the business problem#

Where continuity is optional#

What the paper directly shows, and what Cognaptus infers#

The memory illusion is really an accountability illusion#

The next benchmark may be less glamorous than the next model#

Conclusion: fluency is not persistence#