Opening — Why This Matters Now

Large language models now behave like overeager junior analysts: they think harder, write longer, and try very hard to sound more certain than they should. Iterative reasoning techniques—Chain-of-Thought, Debate, and the new wave of inference-time scaling—promise deeper logic and better truth-seeking. Yet the empirical reality is more awkward: the more these models “reason,” the more they entrench their initial assumptions. The result is polished but stubborn outputs that deviate from Bayesian rationality.

A new paper, Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning fileciteturn0file0, cuts through the noise with a statistical lens. Instead of asking whether the model’s reasoning sounds right, it asks whether the reasoning process behaves rationally—using the Martingale property as a bar for truth-seeking.

Background — Confirmation Bias: Not Just a Human Problem

Belief entrenchment is a close cousin of human confirmation bias: the tendency to selectively interpret information in a way that reinforces prior beliefs. The paper points out that LLMs inherit this through their reasoning dynamics. Across domains where new evidence should update beliefs—forecasting future events, evaluating arguments, reviewing research—models still drift toward their priors.

Traditionally, reasoning quality in AI is measured through outcomes (accuracy, Brier scores, rubric-based judgments). But this outcome-only approach hides the process itself. A model can arrive at the right answer for the wrong reasons—or worse, overfit to spurious patterns simply because they resemble confident human writing.

The Martingale property offers a cleaner, process-level benchmark: your current belief should not predict the direction of your next belief update. If it does, you’re not learning from evidence—you’re just reinforcing yourself.

Analysis — What the Paper Actually Does

The authors operationalize belief entrenchment using a simple regression:

$$ \Delta b = b_{posterior} - b_{prior} = \beta_1 b_{prior} + \beta_0 + \varepsilon $$

Under Bayesian rationality, the Martingale property demands:

$$ E[\Delta b \mid b_{prior}] = 0 $$

Meaning: no predictable drift from prior beliefs.

So β₁ ≈ 0 is the hallmark of rational updating. A positive β₁? Entrenchment.

The paper evaluates:

  • Reasoning modes: CoT vs. Debate
  • Domains: Forecasting, r/ChangeMyView, OpenReview
  • Models: GPT-4o, DeepSeek R1/V3, Gemini 2.0 Flash, Llama 4
  • Prompting styles: Prior-conforming, no prompt, critical-thinking

Across this full grid, belief entrenchment appears pervasively in LLM reasoning.

Key Observation

CoT magnifies entrenchment. Debate mitigates it.

Debate produces smaller and less predictable belief updates. CoT—ironically the poster child for LLM reasoning—系统atically amplifies its own priors.

Findings — What the Data Shows

Based on Table 1 in the paper, CoT exhibits positive Martingale Scores in 51 out of 54 cases fileciteturn0file0—meaning belief entrenchment is not an edge case; it’s structural.

A simplified visualization of the patterns:

Table: Entrenchment Across Domains (Lower is Better)

Domain Avg. Martingale Score (CoT) Interpretation
Forecasting ~0.037 Still entrenched, but less severe
r/ChangeMyView ~0.103 Highest entrenchment—models harden opinions even when evidence should challenge them
OpenReview ~0.086 Similar to value-laden domains; subjective reasoning deepens bias

The most striking relationship appears when mapping Martingale Scores to accuracy.

Chart: The Higher the Entrenchment, the Worse the Accuracy

Models with even mild entrenchment (e.g., Martingale Score 0.04) perform worse than random guessing in forecasting Brier scores. The paper visualizes this correlation clearly on page 9.

Additional Patterns

  • Debate reduces entrenchment across all models.
  • DeepSeek R1 is consistently the least entrenched model.
  • Prior-conforming prompts inflate entrenchment—but critical-thought prompts only mildly reduce it.

Implications — Why This Matters for Business, Regulation, and Real-World Use

1. Business AI tools may sound rational but update irrationally.

Reasoning-heavy workflows (analysis, forecasting, complex planning, compliance reviews) may embed subtle but systematic self-reinforcement. Models that “explain their reasoning” may be the worst offenders.

2. Process-based evaluation is becoming essential.

Outcome accuracy alone no longer guarantees model robustness. Enterprises need internal metrics—like the Martingale Score—to detect when AI-based reasoning pipelines drift from evidence.

3. Regulators may soon look beyond outputs to reasoning dynamics.

If entrenchment is structural, it raises governance concerns for AI-assisted decision-making, especially in:

  • healthcare,
  • financial advice,
  • legal recommendations,
  • automated assessments.

4. Agentic AI must solve this before autonomy becomes safe.

The rise of inference-time-scaling and autonomous agents means models increasingly act on their own reasoning loops. If those loops reinforce their priors, not evidence, the risk compounds.

Conclusion — A Metric for the Next Phase of AI Reasoning

The Martingale Score is refreshingly simple yet conceptually rigorous. It exposes a subtle but pervasive flaw: LLMs don’t naturally behave like Bayesian reasoners, even when they produce elegant reasoning chains.

The next wave of AI evaluation—and AI safety—will require tools that look inside the reasoning process, not just at the final answer. Martingale Score is a clean, domain-agnostic step in that direction.

Cognaptus: Automate the Present, Incubate the Future.