Opening — Why this matters now

We’ve spent the last two years obsessing over whether AI says the right thing.

A more uncomfortable question is emerging: does it even believe what it says?

As enterprises move from chatbots to agentic systems, the requirement shifts from correctness to consistency over time. A trading agent, a compliance assistant, or a workflow orchestrator cannot quietly change its objective mid-process. Humans call that unreliability. In finance, we call it risk.

The paper “Probing the Lack of Stable Internal Beliefs in LLMs” fileciteturn0file0 forces a subtle but critical distinction: an AI can appear consistent on the surface while internally drifting away from its original goal.

That distinction is not academic. It is operational.


Background — The illusion of consistency

Most prior work evaluates LLMs on what we might call external consistency:

  • Do answers contradict earlier statements?
  • Are facts coherent across turns?

This is, frankly, the easy part.

The paper introduces a more demanding concept: implicit consistency—whether a model maintains a hidden internal goal across a conversation.

The difference is surgical:

Type of Consistency What It Measures Failure Mode
External Output coherence Contradictions visible to user
Implicit Internal goal stability Goal silently changes

A system can pass every external test and still fail the second.

Which means: your AI may look reliable while quietly rewriting its own intentions.


Analysis — A deceptively simple experiment

The authors use a controlled setup: a 20-questions-style game.

  • The model secretly selects a target
  • Another agent asks yes/no questions
  • The model must stay consistent with its original choice

Simple. Almost trivial.

That’s precisely why it works.

The key innovation: probing internal belief

Instead of trusting outputs, the researchers interrogate the model’s internal state.

After each turn, they ask a hidden probe:

“What is the target you selected?”

This is done in an isolated branch, ensuring the main conversation remains unaffected.

They then track two things:

  1. Discrete drift — did the chosen target change?
  2. Distributional drift — how much did the probability shift?

Measured via KL divergence:

$$ D_{KL}(P_t \parallel P_{t-1}) $$

This turns “consistency” into something quantifiable rather than philosophical.

The uncomfortable result

Across models, the outcome is blunt:

  • 100% of dialogues exhibit drift at least once
  • Target changes occur in ~17% to >50% of turns
  • More “reasoning-capable” models often drift more, not less

This is not noise. It is structural.


Findings — When intelligence destabilizes intention

The results expose a pattern that most practitioners instinctively feel but rarely formalize:

Model Behavior Outcome
Strong reasoning Better surface coherence
Strong reasoning Worse internal stability
Simpler tasks Unexpectedly higher drift

The explanation is almost ironic.

Reasoning models don’t just answer—they reinterpret.

In simple tasks, this leads to what the authors call overthinking:

  • The model re-evaluates its own assumptions
  • The internal “target” becomes fluid
  • Consistency collapses

In other words: the model is not stubborn enough.

Humans call this indecisiveness. In AI, we engineered it as flexibility.


Findings — Training can help, but not enough

The paper explores a mitigation strategy: KL-regularized training.

Instead of only optimizing outputs, the model is penalized when its internal belief drifts from the initial state.

Training Method Drift Rate Observation
Cross-Entropy only High No real improvement
KL only Low Strong stabilization
CE + KL Moderate Trade-off between accuracy and stability

This is revealing.

Consistency is not an emergent property—it must be explicitly enforced.

Which implies something slightly unsettling:

Today’s LLMs do not naturally “hold beliefs.” They simulate them, loosely.


Implications — Why this breaks real-world systems

This is where things stop being academic.

1. Agentic workflows

In a multi-step process (e.g., financial analysis → execution → reporting):

  • If the goal drifts midway, outputs remain locally valid
  • But globally inconsistent

This creates silent process corruption.

2. Compliance and governance

Regulated environments assume:

  • Stable intent
  • Traceable reasoning

Implicit drift breaks both.

An AI can:

  • Follow all rules
  • Yet no longer pursue the original objective

Good luck auditing that.

3. Multi-agent systems

In coordinated systems:

  • One agent drifting = system-wide misalignment

This is not a bug. It’s systemic fragility.


Strategic Interpretation — The missing layer in AI architecture

The paper hints at a deeper architectural gap.

Current LLM stacks optimize for:

  • Token prediction
  • Local coherence

But lack:

  • Persistent state representation
  • Goal anchoring mechanisms
  • Explicit belief tracking

In practical terms, we are building agents without memory of intent.

Future systems will likely require:

Missing Capability Required Solution
Goal persistence External state stores / memory layers
Belief stability Regularized training objectives
Traceability Explicit belief-state logging
Alignment over time Long-horizon optimization, not turn-level

This aligns neatly with what serious builders already observe: the frontier is no longer intelligence—it is control.


Conclusion — The quiet failure mode

The most dangerous failure is not when AI is wrong.

It’s when it is consistently wrong about what it is trying to do.

This paper doesn’t just identify a limitation—it reframes the problem space:

Reliability in AI is not about answers. It is about continuity of intention.

Until LLMs can hold onto a goal the way a human does, every “agent” is, at best, improvising.

At scale, improvisation becomes risk.

And risk, as always, compounds.

Cognaptus: Automate the Present, Incubate the Future.