Opening — Why this matters now
We’ve spent the last two years obsessing over whether AI says the right thing.
A more uncomfortable question is emerging: does it even believe what it says?
As enterprises move from chatbots to agentic systems, the requirement shifts from correctness to consistency over time. A trading agent, a compliance assistant, or a workflow orchestrator cannot quietly change its objective mid-process. Humans call that unreliability. In finance, we call it risk.
The paper “Probing the Lack of Stable Internal Beliefs in LLMs” fileciteturn0file0 forces a subtle but critical distinction: an AI can appear consistent on the surface while internally drifting away from its original goal.
That distinction is not academic. It is operational.
Background — The illusion of consistency
Most prior work evaluates LLMs on what we might call external consistency:
- Do answers contradict earlier statements?
- Are facts coherent across turns?
This is, frankly, the easy part.
The paper introduces a more demanding concept: implicit consistency—whether a model maintains a hidden internal goal across a conversation.
The difference is surgical:
| Type of Consistency | What It Measures | Failure Mode |
|---|---|---|
| External | Output coherence | Contradictions visible to user |
| Implicit | Internal goal stability | Goal silently changes |
A system can pass every external test and still fail the second.
Which means: your AI may look reliable while quietly rewriting its own intentions.
Analysis — A deceptively simple experiment
The authors use a controlled setup: a 20-questions-style game.
- The model secretly selects a target
- Another agent asks yes/no questions
- The model must stay consistent with its original choice
Simple. Almost trivial.
That’s precisely why it works.
The key innovation: probing internal belief
Instead of trusting outputs, the researchers interrogate the model’s internal state.
After each turn, they ask a hidden probe:
“What is the target you selected?”
This is done in an isolated branch, ensuring the main conversation remains unaffected.
They then track two things:
- Discrete drift — did the chosen target change?
- Distributional drift — how much did the probability shift?
Measured via KL divergence:
$$ D_{KL}(P_t \parallel P_{t-1}) $$
This turns “consistency” into something quantifiable rather than philosophical.
The uncomfortable result
Across models, the outcome is blunt:
- 100% of dialogues exhibit drift at least once
- Target changes occur in ~17% to >50% of turns
- More “reasoning-capable” models often drift more, not less
This is not noise. It is structural.
Findings — When intelligence destabilizes intention
The results expose a pattern that most practitioners instinctively feel but rarely formalize:
| Model Behavior | Outcome |
|---|---|
| Strong reasoning | Better surface coherence |
| Strong reasoning | Worse internal stability |
| Simpler tasks | Unexpectedly higher drift |
The explanation is almost ironic.
Reasoning models don’t just answer—they reinterpret.
In simple tasks, this leads to what the authors call overthinking:
- The model re-evaluates its own assumptions
- The internal “target” becomes fluid
- Consistency collapses
In other words: the model is not stubborn enough.
Humans call this indecisiveness. In AI, we engineered it as flexibility.
Findings — Training can help, but not enough
The paper explores a mitigation strategy: KL-regularized training.
Instead of only optimizing outputs, the model is penalized when its internal belief drifts from the initial state.
| Training Method | Drift Rate | Observation |
|---|---|---|
| Cross-Entropy only | High | No real improvement |
| KL only | Low | Strong stabilization |
| CE + KL | Moderate | Trade-off between accuracy and stability |
This is revealing.
Consistency is not an emergent property—it must be explicitly enforced.
Which implies something slightly unsettling:
Today’s LLMs do not naturally “hold beliefs.” They simulate them, loosely.
Implications — Why this breaks real-world systems
This is where things stop being academic.
1. Agentic workflows
In a multi-step process (e.g., financial analysis → execution → reporting):
- If the goal drifts midway, outputs remain locally valid
- But globally inconsistent
This creates silent process corruption.
2. Compliance and governance
Regulated environments assume:
- Stable intent
- Traceable reasoning
Implicit drift breaks both.
An AI can:
- Follow all rules
- Yet no longer pursue the original objective
Good luck auditing that.
3. Multi-agent systems
In coordinated systems:
- One agent drifting = system-wide misalignment
This is not a bug. It’s systemic fragility.
Strategic Interpretation — The missing layer in AI architecture
The paper hints at a deeper architectural gap.
Current LLM stacks optimize for:
- Token prediction
- Local coherence
But lack:
- Persistent state representation
- Goal anchoring mechanisms
- Explicit belief tracking
In practical terms, we are building agents without memory of intent.
Future systems will likely require:
| Missing Capability | Required Solution |
|---|---|
| Goal persistence | External state stores / memory layers |
| Belief stability | Regularized training objectives |
| Traceability | Explicit belief-state logging |
| Alignment over time | Long-horizon optimization, not turn-level |
This aligns neatly with what serious builders already observe: the frontier is no longer intelligence—it is control.
Conclusion — The quiet failure mode
The most dangerous failure is not when AI is wrong.
It’s when it is consistently wrong about what it is trying to do.
This paper doesn’t just identify a limitation—it reframes the problem space:
Reliability in AI is not about answers. It is about continuity of intention.
Until LLMs can hold onto a goal the way a human does, every “agent” is, at best, improvising.
At scale, improvisation becomes risk.
And risk, as always, compounds.
Cognaptus: Automate the Present, Incubate the Future.