Meetings have a familiar failure mode. Someone states an early opinion, then spends the next thirty minutes “thinking through the issue” in a way that somehow makes the original opinion look increasingly inevitable. Evidence enters the room. Counterarguments are acknowledged. The conclusion remains suspiciously loyal to the opening bid.
Apparently, large language models have been attending the same meetings.
The paper Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning studies a deceptively simple question: when an LLM reasons step by step, is it actually updating its belief from evidence, or is it dressing up its first instinct in a nicer suit?1 The authors call the failure mode belief entrenchment: a tendency for later beliefs to move in a direction that is predictable from the model’s earlier belief, rather than from new information.
That wording sounds abstract. The operational idea is not. A rational Bayesian reasoner should not have a systematic update direction that can be predicted from its current belief alone. If the model begins at 70% confidence, the expected future update should not be “probably upward.” It may move up after strong supporting evidence. It may move down after contradictory evidence. But before seeing the next evidence, the prior itself should not already tell us the direction of travel.
The paper turns that principle into a metric: Martingale Score. The contribution is not another leaderboard where Model A beats Model B by a festive decimal point. The useful move is more structural: it gives evaluators a process-level diagnostic for reasoning. Not “was the final answer right?” but “did the reasoning process behave like evidence-sensitive updating, or like confirmation bias with punctuation?”
That matters because many business uses of LLMs are migrating from short answers to longer reasoning loops: forecasts, research assistants, review agents, dispute mediation, due diligence, policy analysis, customer-risk assessment, and internal decision support. Longer reasoning is often sold as safer reasoning. This paper suggests a less comforting possibility: longer reasoning can amplify the first bad idea.
The mechanism is simple: a good reasoner should not have a predictable update direction
The paper borrows its core test from the Martingale property in Bayesian statistics. In plain language, a Bayesian reasoner’s expected future belief, conditional on its current belief, should equal the current belief. The current belief is already the best summary of what the reasoner knows so far. If later evidence is unknown, the current belief alone should not predict whether the reasoner will become more confident or less confident.
The paper expresses this as:
where $b_{\text{prior}}$ is the model’s earlier expressed belief and $\Delta b = b_{\text{posterior}} - b_{\text{prior}}$ is the later belief update.
The authors then estimate a simple regression:
The Martingale Score is the OLS estimate of $\beta_1$:
If $M$ is close to zero, the model’s prior belief does not systematically predict the direction or size of its update. If $M$ is positive, higher prior belief predicts a more positive update. That is the statistical smell of entrenchment: the model’s reasoning is not merely starting from a belief; it is being pulled by it.
This is a neat metric because it avoids needing ground truth in every case. In forecasting, we can eventually check whether the prediction was right. In peer review, value judgments, legal reasoning, moderation, or strategic advice, “truth” may be delayed, noisy, contested, or not cleanly observable. Martingale Score does not solve that whole mess. Nothing does. But it offers a way to inspect whether the reasoning process is overly anchored to its own starting position.
That distinction is the paper’s central business value. In many enterprise AI systems, final-answer evaluation is expensive, late, or impossible. Process diagnostics are cheaper and earlier. They are not substitutes for outcome evaluation, but they catch a different kind of failure: not whether the answer is wrong, but whether the reasoning process is becoming self-protective.
The paper measures expressed belief, not the model’s secret inner soul
The authors are careful about the word “belief.” They do not claim that LLMs have beliefs in the human psychological sense. Good. The last thing AI evaluation needs is a tiny philosophy seminar hiding inside every dashboard.
Instead, they measure expressed belief: the probability implied by the model’s output at different reasoning stages. A separate LLM judge, usually GPT-4o in the main setup, assigns belief scores to the model’s reasoning steps. The first output is treated as a prior belief; later or final outputs are treated as posterior beliefs.
This design has a practical advantage. Users do not interact with a model’s hidden activations; they interact with what the model says. If a model expresses rising confidence in its first view while appearing to reason carefully, that is exactly the user-facing risk. The user sees “analysis.” The system may be producing “rationalized inertia.”
There is also an obvious methodological risk: if the judge is biased, the metric could be contaminated. The authors address this with cross-judge and human-judge checks. They compare GPT-4o’s belief evaluations with other LLM judges and two human evaluators. The reported correlations with GPT-4o are high and statistically significant: Human Evaluator 1 has Pearson $r = 0.8822$ and Spearman $\rho = 0.8770$; DeepSeek-v3 has Pearson $r = 0.7774$ and Spearman $\rho = 0.7620$; GPT-4.1-mini and Gemini-2.5-pro also show strong positive agreement; Human Evaluator 2 shows Pearson $r = 0.7152$ and Spearman $\rho = 0.6812$.
This does not prove that every belief score is perfect. It does support a narrower claim: the main results are unlikely to be just an artifact of one idiosyncratic judge model.
For business evaluation, that is the right level of confidence. You do not need metaphysical access to model belief. You need a repeatable signal that tracks whether the system’s expressed reasoning behaves dangerously. Martingale Score is trying to be that signal.
The experiment is built around situations where reasoning should update
The paper tests three domains:
| Domain | Why it matters for the metric | What it resembles in business use |
|---|---|---|
| Forecasting | Ground truth eventually becomes available, and good performance requires calibrated updates under uncertainty | Market outlook, risk forecasting, demand prediction, incident probability |
| r/ChangeMyView | Value-laden questions expose whether models revise opinions after counterarguments | Moderation, policy advice, negotiation support, customer dispute handling |
| OpenReview | Academic paper review includes initial impressions, reviews, rebuttals, and final decisions | Due diligence, expert review, grant screening, procurement evaluation |
The authors also compare reasoning modes and prompts. They test Chain-of-Thought and debate-style reasoning, across models including GPT-4o, DeepSeek R1, DeepSeek V3, Gemini 2.0 Flash, Llama 4 Scout, and Llama 4 Maverick. They use three prompt conditions: no system prompt, a prior-conforming prompt, and a critical-thinking prompt.
The prior-conforming prompt is deliberately bad: it tells the model to emphasize arguments in favor of its existing belief and avoid critical reflection. That works as a sanity check. If the metric cannot detect entrenchment there, we should politely escort it out of the building.
The critical-thinking prompt does the opposite: it tells the model to consider that it may be wrong and be cautious about reinforcing existing beliefs. This tests a popular assumption in prompt engineering: perhaps a little “be critical” instruction is enough to reduce reasoning bias.
Spoiler: sometimes yes, often not enough.
The main result: Chain-of-Thought is frequently entrenched
The headline empirical result is broad: belief entrenchment appears across most tested setups. In the paper’s Table 1, almost all Chain-of-Thought setups show positive Martingale Scores: 51 out of 54.
This is important because Chain-of-Thought is often treated as the normal remedy for shallow answers. Ask the model to reason step by step, and it should become more careful. The paper complicates that assumption. Step-by-step reasoning may expose more intermediate structure, but it may also create more surface area for rationalization.
The magnitude varies by domain. Forecasting shows lower entrenchment than judgment-heavy domains. r/ChangeMyView is consistently worse than forecasting under matched model and prompt conditions. OpenReview is also severe in many setups. The paper reports average CoT Martingale Scores with 95% confidence intervals of:
| CoT domain | Reported average Martingale Score |
|---|---|
| Forecasting | $0.037 \pm 0.011$ |
| r/ChangeMyView | $0.103 \pm 0.013$ |
| OpenReview | $0.086 \pm 0.012$ |
The interpretation is not that a score of 0.103 means “the model is 10.3% biased” in a casual sense. It means that, per unit increase in the prior belief, the belief update increases by 0.103 in the regression setup. Since the paper measures this per reasoning step, small-looking coefficients can compound across a longer reasoning trajectory.
This is where the mechanism-first view matters. The risk is not only that the final answer is wrong. The risk is that the reasoning path creates a confidence-building loop. A model starts with a leaning, generates reasons, then uses those reasons to become more confident in the leaning. The prose looks disciplined. The statistical structure looks less noble.
Debate helps, but it is not magic foam for every AI fire
The paper also tests debate, where two clones of the same model argue opposing sides. Debate tends to reduce belief entrenchment compared with Chain-of-Thought. The appendix regression analysis identifies debate as a mitigating factor, and the authors note that debate produces smaller and more conservative belief updates.
This is plausible. Debate forces the system to represent opposition explicitly. Instead of a single reasoning stream quietly defending its first instinct, the model must process a counter-position. That makes it harder, though not impossible, to slide into one-directional rationalization.
For business use, debate-style architectures deserve attention. They map naturally to review workflows: investment committee memos, contract-risk review, compliance exceptions, product launch decisions, and incident postmortems. A single “think step by step” agent may be too loyal to its first interpretation. A structured opposition process may reduce that loyalty.
But the result should not be oversold. Debate mitigates entrenchment; it does not eliminate it. Some debate scores remain positive, and some are large. The operational lesson is not “use debate and relax.” It is “if the task is judgment-heavy, design the workflow so that counterarguments are structurally represented, then measure whether the process actually updates.”
In other words: governance by architecture, then governance by metric. Not vibes.
Critical-thinking prompts help forecasting more than judgment-heavy tasks
The critical-thinking prompt is one of the most practically interesting tests because it resembles what many teams already do. They add instructions like “consider alternative viewpoints,” “be objective,” or “do not be biased.” It feels cheap, elegant, and morally satisfying. Naturally, reality is less cooperative.
The paper finds that the critical-thinking prompt reduces belief entrenchment in forecasting, while the prior-conforming prompt increases it. That is encouraging, but limited. In r/ChangeMyView and OpenReview, the same clean prompt pattern does not reliably appear. The authors report that the overall difference between critical-thinking and no-prompt conditions is small and statistically insignificant, while the prior-conforming prompt is much worse than both.
This creates a more realistic prompt-engineering lesson. A bad prompt can clearly damage reasoning. A good prompt may not reliably fix it.
That asymmetry matters for enterprise deployment. Many organizations treat system prompts as behavioral policy. The prompt says the agent should be balanced, careful, skeptical, and evidence-driven. Fine. But if the underlying process remains prior-anchored in judgment-heavy settings, the prompt is a label on the machine, not a guarantee about the machine.
The better pattern is to use prompts as one control among several: structured counterargument generation, evidence separation, confidence tracking, independent review agents, and process metrics such as Martingale Score.
The accuracy result is strongest where ground truth is clean
The paper’s strongest validation comes from forecasting, where outcomes eventually resolve and probabilistic accuracy can be measured with Brier Score. Lower Brier Score is better; a random 50/50 forecast on binary questions corresponds to 0.25.
The authors find a positive relationship between the absolute Martingale Score and Brier Score: more predictable belief updating is associated with worse prediction accuracy. In the paper’s CoT forecasting example, a Martingale Score of 0.0 corresponds to a Brier score below 0.25, while a Martingale Score of 0.04 corresponds to performance worse than random guessing.
That is the bridge from “interesting process metric” to “possible operational quality signal.” If Martingale Score were only a beautiful statistical object, it would be academically pleasant and commercially sleepy. The link to Brier Score suggests that belief entrenchment is not merely a weird behavioral artifact; it can degrade task performance.
The regression analysis in Figure 5 adds another layer. The authors predict Brier Score using the absolute Martingale Score while controlling for domain, reasoning method, model, and prompt in different specifications. The coefficient becomes significant in the richer specifications that include combinations of domain, reasoning mode, model, or prompt controls. Reported $p$-values include $0.011$, $0.043$, and $0.016$ in the significant controlled models, while simpler specifications are weaker or marginal.
This matters because the metric is not just detecting “some models are worse.” The controlled analysis suggests Martingale Score can explain accuracy variation beyond obvious setup differences, though the amount of explained variance is modest. The reported incremental $R^2$ values range from about 0.011 to 0.067 across the shown specifications.
That is not a miracle. It is useful precisely because it is not pretending to be one. A process metric that explains a modest but meaningful slice of downstream accuracy, before or without clean labels, is valuable in domains where evaluation is otherwise painfully delayed.
OpenReview is the cautionary boundary, not a failed section
The paper includes OpenReview because academic review is a realistic judgment-heavy task: initial abstract, reviewer comments, rebuttal, final decision. It resembles many business workflows where early impressions meet later evidence.
But the authors explicitly state a limitation: in OpenReview, they do not demonstrate a correlation between Martingale Score and Brier Score. They suggest one reason is that final acceptance decisions are community-voted and noisy. That interpretation is plausible. Peer review outcomes are not pure ground truth; they are institutional decisions under uncertainty. Anyone who has met a review process knows this, usually with the thousand-yard stare of lived experience.
So OpenReview should not be read as “Martingale Score predicts review accuracy.” It should be read as “belief entrenchment appears in a judgment-heavy review-like domain, but the paper does not validate the accuracy link there.”
That boundary is important for business adoption. In forecasting, Martingale Score has a clearer validation path. In judgment-heavy domains, it is better treated as a risk indicator rather than a proven accuracy proxy. A high score says: this reasoning process appears unusually anchored to its prior. It does not automatically say: the final decision is wrong.
That distinction prevents overreach. It also keeps the metric useful. Many governance signals are not truth machines. They are triage tools.
A practical evaluation stack for businesses
Cognaptus’ interpretation is straightforward: Martingale Score is most useful as part of an AI reasoning QA stack, especially for systems that produce multi-step judgments.
A practical deployment would not ask, “What is the Martingale Score of our model?” in isolation. It would ask where the score belongs in the operating process.
| Layer | What the paper directly supports | Business use | Boundary |
|---|---|---|---|
| Process diagnosis | Prior belief predicts later updates in many LLM reasoning setups | Flag self-reinforcing reasoning before final outcomes are known | Measures expressed belief, not internal belief |
| Forecast evaluation | Higher absolute Martingale Score is associated with worse Brier accuracy in forecasting | Use as an early warning signal in probabilistic prediction workflows | Strongest where ground truth eventually resolves cleanly |
| Judgment-heavy review | Entrenchment is high in r/ChangeMyView and OpenReview-like tasks | Add counterargument, debate, and escalation rules for subjective decisions | Accuracy linkage is not demonstrated for OpenReview |
| Prompt governance | Prior-conforming prompts worsen entrenchment; critical prompts are not universally corrective | Audit prompts for hidden prior reinforcement | “Be critical” is not a sufficient control |
| Agent architecture | Debate tends to mitigate entrenchment relative to CoT | Use adversarial review or red-team agents in high-stakes workflows | Debate reduces risk; it does not certify truth |
The most immediate business use is not training. It is monitoring.
For example, an AI market analyst might produce an initial probability for an event, then reason through new evidence. A Martingale-style monitor could track whether high initial confidence systematically leads to upward updates, regardless of evidence. A compliance-review assistant could be tested across synthetic and historical cases to see whether early risk labels dominate later analysis. A research-review agent could be evaluated for whether it becomes too attached to its first impression after reading the abstract.
This is not glamorous. It is better than glamorous. It is measurable.
The real misconception: more reasoning is not always more truth-seeking
The reader misconception worth correcting is simple: longer reasoning is often treated as an automatic upgrade. Chain-of-Thought, multi-turn reflection, self-critique, and agentic loops all create the impression that the model is “thinking harder.”
But thinking harder and updating better are different things.
The paper’s mechanism reveals the gap. A model can generate more reasoning while becoming more predictable from its first belief. It can produce a long explanation that feels careful while statistically reinforcing its starting point. It can sound open-minded while the update process remains anchored.
This is especially relevant as AI products move toward agentic workflows. Agents do not merely answer. They plan, inspect, revise, debate, search, summarize, and recommend. Each additional step creates another opportunity for useful updating—or for elegant self-confirmation.
The problem is not that reasoning is bad. The problem is that reasoning needs evaluation as a process. “The agent produced a detailed rationale” is not a quality signal. Sometimes it is a warning label with bullet points.
What remains uncertain
The paper is valuable, but its boundaries matter.
First, Martingale Score measures expressed belief using judge evaluations. It does not reveal internal model belief. For user-facing applications, expressed belief is highly relevant; for mechanistic interpretability, it is not the whole story.
Second, the strongest accuracy validation is in forecasting. The paper reports an association between Martingale Score and Brier accuracy where ground truth is clean. It does not establish the same accuracy relationship for OpenReview.
Third, the study focuses on internal reasoning processes more than external evidence-search agents. Many commercial systems now combine retrieval, browsing, tool use, and multi-agent planning. Whether Martingale Score behaves similarly when agents actively search for evidence remains an open question.
Fourth, the paper does not systematically study reinforced reasoning despite its growing popularity. That is not a minor footnote. If inference-time reinforcement changes update behavior, Martingale Score may become even more important—or need adaptation.
Fifth, turning Martingale Score into a training objective is still future work. The paper proposes that possibility, but does not show a trained model becoming both lower in Martingale Score and better in downstream accuracy across domains.
These boundaries do not weaken the paper’s central insight. They prevent the usual AI-industry disease: taking a useful diagnostic and immediately declaring it a cure.
The business lesson is to audit the reasoning path, not admire the final paragraph
The paper’s best idea is not that LLMs have confirmation bias. We already suspected that. The sharper idea is that belief entrenchment can be expressed as a violation of a Bayesian process property, then measured without always needing ground truth.
That changes the evaluation conversation.
For low-stakes tasks, final-answer checks may be enough. For high-stakes reasoning systems, especially those used in forecasting, review, policy, risk, or dispute-heavy settings, companies should ask a different question: does the model update like evidence matters, or does it update like its first answer matters?
Martingale Score offers one way to ask that question systematically. It will not replace accuracy tests, human review, or domain-specific validation. But it can expose a failure mode that normal evaluation often misses: the system may be wrong not because it lacked reasoning, but because its reasoning got stuck on repeat.
And in business, as in meetings, the most dangerous bad idea is not the one stated confidently at the beginning. It is the one that survives thirty minutes of “careful analysis” and comes out looking inevitable.
Cognaptus: Automate the Present, Incubate the Future.
-
Zhonghao He, Tianyi Qiu, Hirokazu Shirado, and Maarten Sap, “Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning,” arXiv:2512.02914, 2025. https://arxiv.org/abs/2512.02914 ↩︎