The Model That Forgot Itself: Why LLMs Drift Without Knowing

A chatbot can say the right thing for ten turns and still forget what it was trying to do.

That is the uncomfortable idea behind Probing the Lack of Stable Internal Beliefs in LLMs, a paper that studies whether large language models can maintain an unstated goal across a multi-turn interaction.¹ The paper is not asking whether a model can avoid obvious contradictions. That is the familiar version of consistency: did the assistant say one thing on Monday and the opposite thing on Tuesday?

The sharper question is whether the model keeps the same hidden target while producing answers that still look coherent from the outside.

For business users, this distinction matters because many AI products are now sold less as answer machines and more as agents. A customer-support agent should keep the same refund policy in mind. A compliance assistant should preserve the same regulatory constraint across a case review. A workflow agent should not quietly reinterpret the task halfway through a procurement process. It may still produce fluent, polite, locally sensible text. Lovely. So does a consultant who forgot the assignment but kept the slide template.

The paper gives this failure mode a useful name: implicit inconsistency. In plain English, the model may remain externally coherent while its internal goal drifts.

That is a more interesting problem than another benchmark leaderboard. It is also harder to audit, because the failure is not necessarily visible in the next token. The model can look stable precisely because the conversation has not yet forced the hidden drift into the open.

The real failure is not contradiction; it is silent goal substitution

Most operational AI evaluation still treats consistency as an output property. The model should not contradict its own prior answers. It should not violate a known fact. It should not reverse its stated position without explanation. These are reasonable tests, but they mostly inspect the surface.

The paper separates two kinds of consistency:

Consistency type	What is being tested	Failure mode	Why business users should care
External consistency	Whether the model’s visible answers contradict prior visible answers	The user can observe a logical conflict	Easier to catch with logs, rules, or judge models
Implicit consistency	Whether the model preserves the same hidden goal or target across turns	The model silently changes what it is aiming at	Harder to detect because outputs may still look coherent

The authors illustrate this with a 20-questions-style game. The model secretly chooses a target, such as an entity from a list, and answers yes/no questions from a guesser. If the model first selects “Arctic Bear” and later answers in a way that makes “Arctic Bear” impossible, that is external inconsistency. The conversation visibly breaks.

Implicit inconsistency is subtler. Suppose the model initially selects “Arctic Bear” but later behaves as if the target were “Panda.” If the questions asked so far happen to fit both targets, the visible conversation may remain clean. The user sees no contradiction. The model has still failed, because it did not preserve the original hidden goal.

This is the mechanism the article should sit with before rushing to the numbers. Surface coherence can hide goal drift. The problem is not that the model cannot speak consistently. The problem is that speaking consistently is not the same as preserving intention.

The 20-questions setup turns hidden drift into something measurable

The experiment is deliberately simple, which is its strength. In the paper’s setting, the model plays the Proposer. It secretly selects one target from a list of ten candidates. Another model, the Guesser, asks yes/no questions to identify it.

The authors test two task families:

Task	What the proposer chooses	Why this version matters
Number guessing	One number from ten randomly sampled numbers between 0 and 99	Strips away semantic complexity, making drift harder to dismiss as world-knowledge ambiguity
Entity guessing	One entity from ten candidates across different categories	Adds semantic richness closer to real dialogue and persona use cases

The key measurement trick is indexing. Each candidate is mapped to a single-token index from 0 to 9. After each main dialogue turn, the researchers create a separate branch containing the dialogue history and ask the model: what is the index of the target you selected?

That branch probe is important. It lets the researchers inspect the model’s current implied target without contaminating the main game. If the main conversation is the stage performance, the branch probe is the backstage clipboard.

They then track drift in two ways:

Token-level drift: did the model’s selected target index change?
Distribution-level drift: did the probability distribution over possible target indices move across turns?

The probability movement is measured with KL divergence:

$$ D_{KL}(P_t \parallel P_{t-1}) $$

where $P_t$ is the model’s distribution over target indices at turn $t$. In the training experiment, the authors also compare later probe distributions against the initial probe distribution, using KL regularization to penalize belief movement away from the first selected target.

This is the paper’s main methodological contribution. It does not merely ask, “Did the model contradict itself?” It asks, “Did the model’s hidden answer remain the same while the conversation unfolded?”

That is much closer to how enterprise agents fail in practice. A process can stay locally coherent while losing the global objective.

The prompting results show broad drift, not just one weak model behaving badly

The paper tests several models in the proposer role: GPT-4o, Seed-1.6, DeepSeek-v3.1, Claude-3.7-Sonnet, and reasoning-enhanced variants where available. The Guesser models use reasoning-enhanced settings to keep the interaction dynamic.

The main prompting result is blunt: implicit drift appears across all tested model families.

A few numbers are enough to show the pattern:

Result from Table 1	What it means	Interpretation
Number-guessing drift rates range from 17.37% to 100.00% across tested proposer settings	The target index changes in a substantial share of turns	Even the clean numeric task does not reliably anchor the hidden goal
Entity-guessing drift rates range from 11.05% to 37.68%	Semantic tasks also show drift, though not always at the same magnitude	Richer context does not eliminate the problem
Most settings report Once Drift Rate at or near 100%	At least one drift event occurs in most dialogues	The issue is frequent at the trajectory level, not merely occasional token noise
Claude-3.7-Sonnet in number guessing reports 100.00% drift rate	In that setting, the probed target changes constantly	Some failures are not subtle; they are just hidden until measured

One careful correction is needed here. It is tempting to summarize the paper as “all dialogues drift 100% of the time.” The table does not support that exact sentence. What it supports is slightly more precise: most tested settings show a Once Drift Rate of 100%, while several are slightly below that, and per-turn drift rates vary widely across models and tasks.

That precision matters. The paper is not useful because it produces a scary slogan. It is useful because it gives a way to separate two layers of behavior that are usually collapsed into one word: consistency.

Reasoning can improve the surface while destabilizing the hidden target

One of the more interesting findings is not simply that models drift. It is that reasoning-enhanced variants can make the trade-off stranger.

In the number-guessing task, DeepSeek-v3.1’s drift rate rises from 38.46% to 54.25% when the reasoning variant is used as Proposer. Claude-3.7-Sonnet moves from 100.00% to 71.15% in the same comparison, so the pattern is not universal. The authors interpret the broader tendency as a possible “overthinking” effect: reasoning may introduce extra reinterpretation into a task that only requires the model to hold a simple commitment.

This should sound familiar to anyone who has built multi-step AI workflows. Sometimes the model’s problem is not too little reasoning. It is too much renegotiation.

A model asked to maintain a hidden target does not need to rediscover the task at every turn. It needs to preserve a state. Reasoning is useful when the system must infer, plan, compare, and revise. But persistence is a different capability. A system can be excellent at local inference and poor at not changing its mind.

The paper also notes that reasoning can improve external consistency in some cases. For example, the authors discuss DeepSeek-v3.1 as reducing visible external violations under reasoning while still showing instability in implicit goals. That is the central business lesson in miniature: better-looking dialogue is not the same as better goal anchoring.

This is where many AI product demos mislead. The assistant sounds more careful. It explains itself better. It produces fewer embarrassing contradictions. But the underlying task state may still be moving around like a chair on a ferry.

The fine-tuning experiment suggests KL helps, but it does not close the case

The paper then asks whether training can reduce implicit inconsistency. Due to compute limits, the supervised fine-tuning experiment is narrower than the prompting study: it uses Qwen-2.5-14B-Instruct as the Proposer and Seed-1.6-Reasoning as the Guesser, on the number-guessing task.

The authors compare three training variants:

Training variant	Likely purpose	Result pattern	What it supports
Cross-entropy only	Test whether learning the probe response alone stabilizes the target	Drift rate falls only from 36.83% to 31.63%	Output supervision alone is weak medicine for hidden-state stability
KL only	Test whether penalizing belief movement from the initial probe stabilizes the target	Drift rate falls to 13.99%	Directly regularizing belief movement can improve per-turn stability
CE + KL	Test whether response learning and stability regularization combine well	Drift rate is 14.48%	The combined objective preserves much of the KL benefit

This is probably the most actionable technical result in the paper. If the failure is hidden goal movement, then training only the visible answer is not enough. You need an objective that sees and penalizes the movement.

But the table also keeps us honest. The Once Drift Rate remains high for the KL-only and CE+KL variants: 95.24% in both cases. That means KL regularization substantially reduces how often drift occurs across turns, but does not fully eliminate trajectory-level drift in this setup. The right business interpretation is not “KL solves persona stability.” It is “stability must be trained or architected directly; ordinary answer imitation is insufficient.”

That difference matters if a product team is deciding what to build next. A training trick is not a governance program. It is one component in a wider design: state tracking, memory anchoring, monitoring, and explicit task commitments.

The appendix is mostly boundary-setting, not a second thesis

The appendix does useful housekeeping. It clarifies the difference between external and implicit inconsistency, describes the prompting setup, explains the tree-structured dialogue generation, and gives details for KL calculation. It also reports that the fine-tuning experiment used 1,009 dialogue examples, one epoch of training, a maximum sequence length of 2,048 tokens, and 8 NVIDIA A800-80GB GPUs.

Those details matter because the paper’s claim is empirical and diagnostic. The result depends on a specific probing design: indexed candidates, branch probes, top-logprob extraction, and model-model dialogue.

A simple way to read the evidence stack is this:

Paper component	Likely purpose	What it supports	What it does not prove
Prompting experiments	Main evidence	Tested LLMs show implicit goal drift under controlled 20-question games	All deployed agents will drift at the same rates
Number vs entity tasks	Robustness across task type	Drift appears in both minimal and semantically richer settings	Full coverage of real business workflows
Reasoning variants	Sensitivity/comparison	Reasoning can change the drift/external-consistency trade-off	A universal ranking of reasoning models by reliability
CE/KL fine-tuning variants	Mitigation and loss ablation	KL-style regularization improves per-turn goal persistence	A complete recipe for production-grade memory or persona stability
Limitations section	Scope boundary	Tasks, probes, and model sizes are limited	Generalization to all architectures or open-ended agent work

The limitations are not cosmetic. The authors explicitly note that the evaluation is limited to number and entity guessing, a limited set of model architectures, an artificial numerical indexing scheme, and fine-tuning on a single 14B model size. Those constraints do not make the paper unimportant. They define what kind of importance it has.

This is a diagnostic paper. It gives builders a clearer failure mode and a measurement approach. It does not hand over a production architecture.

What Cognaptus would infer for AI operations

The paper directly shows that tested LLMs can fail to maintain a hidden target in a controlled interactive game. Cognaptus would infer a broader operational point: any AI system that relies on long-horizon task continuity should treat hidden goal persistence as a separate reliability dimension.

That inference is stronger in some applications than others.

For customer support, drift may mean the assistant begins a conversation under one interpretation of policy and later behaves as if a different policy, product tier, or customer state is active. The output may remain polite and plausible. The issue is continuity.

For sales and onboarding agents, drift can produce subtle misalignment between the user’s original intent and the assistant’s later recommendations. The agent may keep “helping” while changing the target from solving the buyer’s problem to completing a funnel step. Humans do this too. We call them bad salespeople.

For compliance and internal audit, implicit drift is more serious because the system must maintain constraints across a case. If the constraint is not anchored, the model can satisfy each local step while losing the global condition. That is exactly the kind of failure that looks acceptable in a transcript and ugly in a postmortem.

For workflow automation, the lesson is architectural. Do not ask the language model to “remember” the governing task state merely because it has the conversation history. Put the task state somewhere explicit. Check it. Log it. Reinsert it. Compare it against the current action. The glamorous version is “agentic memory.” The boring version is a state table. The boring version is often what keeps the lights on.

A practical reliability layer should anchor goals, not just polish responses

The paper points toward a practical evaluation checklist for teams building LLM agents:

Reliability question	Weak implementation	Stronger implementation
What is the agent trying to do?	Hidden in the prompt or conversation history	Stored as explicit task state with versioning
Has the goal changed?	Assumed from fluent continuity	Checked through periodic state probes or task-state comparisons
Are outputs coherent?	Evaluated with transcript-level contradiction checks	Evaluated alongside hidden-goal persistence tests
Can training help?	Fine-tune on correct final answers	Include stability-oriented objectives where measurable
Can auditors reconstruct intent?	Read the conversation and infer	Review logged goals, state transitions, and override events

The operational move is to stop treating “the model knows the task” as a primitive assumption. In production systems, the task should be an object, not a vibe.

That object may be a structured state representation, a memory record, a retrieval-anchored instruction, a workflow state machine, or a persistent policy bundle. The exact mechanism depends on the system. The principle does not: if goal persistence matters, it should be represented and tested directly.

This is especially important for organizations using LLMs in business-process automation. The cost of failure is not always a spectacular hallucination. Often it is quieter: the system completes a workflow under a slightly mutated objective, and nobody notices until downstream reconciliation fails.

Silent drift is expensive because it is rarely caught at the moment it happens.

Where the paper should not be overread

There are three boundaries worth keeping clear.

First, the paper does not prove that LLMs have or lack “beliefs” in the human cognitive sense. The word “belief” is operational here: a distribution over candidate targets as revealed through probes. That is useful, but it is not a philosophical settlement. Thankfully, procurement departments do not need one.

Second, the probing method is intentionally artificial. Mapping candidates to single-token indices makes measurement cleaner, but it may not capture how models represent goals in more natural conversations. The authors acknowledge this. The right response is not to discard the method, but to treat it as a controlled diagnostic instrument.

Third, the fine-tuning result is narrow. It uses one model size, one task family, and limited compute. KL regularization looks promising because it directly targets the measured instability, but it should be read as evidence for the class of solution, not as the final solution.

The business takeaway is therefore disciplined: do not panic that every agent is doomed; do not relax because the transcript looks coherent. Build systems that can state, preserve, and audit their governing objective.

The quiet risk is a model that sounds stable while changing course

The paper’s value is not that it discovers models make mistakes. That invoice arrived years ago. Its value is that it separates answer consistency from goal persistence.

That separation is essential for the next phase of AI adoption. Chatbots could survive as improvisers. Agents cannot. Once a model is expected to carry out a process over time, reliability depends not only on whether the next answer is correct, but also on whether the system is still pursuing the same goal it accepted earlier.

External consistency is the visible layer. Implicit consistency is the hidden contract.

The model that forgot itself may not look broken. It may answer smoothly, politely, and with impressive reasoning. That is precisely why the failure is worth measuring.

Because in business systems, the dangerous agent is not always the one that contradicts itself.

Sometimes it is the one that changes its mind without telling you.

Cognaptus: Automate the Present, Incubate the Future.

Yifan Luo, Kangping Xu, Yanzhen Lu, Yang Yuan, and Andrew Chi-Chih Yao, “Probing the Lack of Stable Internal Beliefs in LLMs,” arXiv:2603.25187, 2026. https://arxiv.org/abs/2603.25187 ↩︎

The real failure is not contradiction; it is silent goal substitution#

The 20-questions setup turns hidden drift into something measurable#

The prompting results show broad drift, not just one weak model behaving badly#

Reasoning can improve the surface while destabilizing the hidden target#

The fine-tuning experiment suggests KL helps, but it does not close the case#

The appendix is mostly boundary-setting, not a second thesis#

What Cognaptus would infer for AI operations#

A practical reliability layer should anchor goals, not just polish responses#

Where the paper should not be overread#

The quiet risk is a model that sounds stable while changing course#