The Model That Forgot Itself: Why LLMs Drift Without Knowing

Opening — Why this matters now

We’ve spent the last two years obsessing over whether AI says the right thing.

A more uncomfortable question is emerging: does it even believe what it says?

As enterprises move from chatbots to agentic systems, the requirement shifts from correctness to consistency over time. A trading agent, a compliance assistant, or a workflow orchestrator cannot quietly change its objective mid-process. Humans call that unreliability. In finance, we call it risk.

The paper “Probing the Lack of Stable Internal Beliefs in LLMs” fileciteturn0file0 forces a subtle but critical distinction: an AI can appear consistent on the surface while internally drifting away from its original goal.

That distinction is not academic. It is operational.

Background — The illusion of consistency

Most prior work evaluates LLMs on what we might call external consistency:

Do answers contradict earlier statements?
Are facts coherent across turns?

This is, frankly, the easy part.

The paper introduces a more demanding concept: implicit consistency—whether a model maintains a hidden internal goal across a conversation.

The difference is surgical:

Type of Consistency	What It Measures	Failure Mode
External	Output coherence	Contradictions visible to user
Implicit	Internal goal stability	Goal silently changes

A system can pass every external test and still fail the second.

Which means: your AI may look reliable while quietly rewriting its own intentions.

Analysis — A deceptively simple experiment

The authors use a controlled setup: a 20-questions-style game.

The model secretly selects a target
Another agent asks yes/no questions
The model must stay consistent with its original choice

Simple. Almost trivial.

That’s precisely why it works.

The key innovation: probing internal belief

Instead of trusting outputs, the researchers interrogate the model’s internal state.

After each turn, they ask a hidden probe:

“What is the target you selected?”

This is done in an isolated branch, ensuring the main conversation remains unaffected.

They then track two things:

Discrete drift — did the chosen target change?
Distributional drift — how much did the probability shift?

Measured via KL divergence:

$$ D_{KL}(P_t \parallel P_{t-1}) $$

This turns “consistency” into something quantifiable rather than philosophical.

The uncomfortable result

Across models, the outcome is blunt:

100% of dialogues exhibit drift at least once
Target changes occur in ~17% to >50% of turns
More “reasoning-capable” models often drift more, not less

This is not noise. It is structural.

Findings — When intelligence destabilizes intention

The results expose a pattern that most practitioners instinctively feel but rarely formalize:

Model Behavior	Outcome
Strong reasoning	Better surface coherence
Strong reasoning	Worse internal stability
Simpler tasks	Unexpectedly higher drift

The explanation is almost ironic.

Reasoning models don’t just answer—they reinterpret.

In simple tasks, this leads to what the authors call overthinking:

The model re-evaluates its own assumptions
The internal “target” becomes fluid
Consistency collapses

In other words: the model is not stubborn enough.

Humans call this indecisiveness. In AI, we engineered it as flexibility.

Findings — Training can help, but not enough

The paper explores a mitigation strategy: KL-regularized training.

Instead of only optimizing outputs, the model is penalized when its internal belief drifts from the initial state.

Training Method	Drift Rate	Observation
Cross-Entropy only	High	No real improvement
KL only	Low	Strong stabilization
CE + KL	Moderate	Trade-off between accuracy and stability

This is revealing.

Consistency is not an emergent property—it must be explicitly enforced.

Which implies something slightly unsettling:

Today’s LLMs do not naturally “hold beliefs.” They simulate them, loosely.

Implications — Why this breaks real-world systems

This is where things stop being academic.

1. Agentic workflows

In a multi-step process (e.g., financial analysis → execution → reporting):

If the goal drifts midway, outputs remain locally valid
But globally inconsistent

This creates silent process corruption.

2. Compliance and governance

Regulated environments assume:

Stable intent
Traceable reasoning

Implicit drift breaks both.

An AI can:

Follow all rules
Yet no longer pursue the original objective

Good luck auditing that.

3. Multi-agent systems

In coordinated systems:

One agent drifting = system-wide misalignment

This is not a bug. It’s systemic fragility.

Strategic Interpretation — The missing layer in AI architecture

The paper hints at a deeper architectural gap.

Current LLM stacks optimize for:

Token prediction
Local coherence

But lack:

Persistent state representation
Goal anchoring mechanisms
Explicit belief tracking

In practical terms, we are building agents without memory of intent.

Future systems will likely require:

Missing Capability	Required Solution
Goal persistence	External state stores / memory layers
Belief stability	Regularized training objectives
Traceability	Explicit belief-state logging
Alignment over time	Long-horizon optimization, not turn-level

This aligns neatly with what serious builders already observe: the frontier is no longer intelligence—it is control.

Conclusion — The quiet failure mode

The most dangerous failure is not when AI is wrong.

It’s when it is consistently wrong about what it is trying to do.

This paper doesn’t just identify a limitation—it reframes the problem space:

Reliability in AI is not about answers. It is about continuity of intention.

Until LLMs can hold onto a goal the way a human does, every “agent” is, at best, improvising.

At scale, improvisation becomes risk.

And risk, as always, compounds.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The illusion of consistency#

Analysis — A deceptively simple experiment#

The key innovation: probing internal belief#

The uncomfortable result#

Findings — When intelligence destabilizes intention#

Findings — Training can help, but not enough#

Implications — Why this breaks real-world systems#

1. Agentic workflows#

2. Compliance and governance#

3. Multi-agent systems#

Strategic Interpretation — The missing layer in AI architecture#

Conclusion — The quiet failure mode#