Conformal Thinking: Teaching LLMs When to Stop Thinking

Thinking is not free.

That sentence should not need explaining to anyone who has paid an inference bill, waited for a reasoning model to finish its theatrical inner monologue, or watched an AI agent spend half its budget trying to solve a task it was never going to solve. Reasoning models have become better at using more tokens. They have not automatically become better at knowing when more tokens have stopped helping.

This is the practical irritation behind Conformal Thinking: Risk Control for Reasoning on a Compute Budget.¹ The paper is not merely another attempt to make chain-of-thought shorter. Its more useful move is to reframe adaptive reasoning as risk management. Instead of asking, “How many tokens should we allow?” or “What confidence threshold feels safe?”, it asks a cleaner operational question: “How much error risk are we willing to tolerate when we stop early?”

That sounds modest. It is not. It changes the control interface from an opaque technical knob to a business-facing tolerance. The model may still be uncertain. The signal may still be imperfect. The validation set may still be annoyingly finite, because reality rarely ships with infinite calibration data. But the decision is no longer hidden inside a threshold whose value means one thing for one model, another thing for another model, and absolutely nothing to the finance team.

The common misconception is that longer reasoning is always safer. The paper’s answer is more inconvenient: longer reasoning is sometimes useful, sometimes wasteful, and sometimes simply a polite way of burning compute while pretending progress is imminent.

Early stopping has two failure modes, not one

Most early-stopping work around reasoning models begins with a sensible idea: monitor the model’s confidence or uncertainty as it reasons, and stop once the model seems confident enough. In the paper’s terms, this is an upper-threshold rule. The reasoning trace grows step by step; an uncertainty or confidence signal is computed at intermediate points; once the transformed signal crosses a threshold, the system stops and asks for the final answer.

This handles one obvious waste pattern: the model has already reached a good answer, but keeps talking. Anyone who has used modern reasoning models has seen this. The model finds the answer, circles the airport three more times, and then lands with great ceremony.

But upper-threshold stopping misses a second waste pattern. Some problems do not become more solvable just because the model is allowed to continue. The confidence signal may fluctuate, stagnate, or fail to approach the upper threshold. If the only exit condition is “stop when confident,” difficult or unsolvable instances often run to the maximum token budget. That is not caution. That is a very expensive way to say “I don’t know.”

The paper’s central mechanism is therefore dual-threshold early stopping:

Exit mechanism	What it detects	What can go wrong	Operational interpretation
Upper threshold	The model appears confident enough to answer	False positive: it stops and answers incorrectly	“This looks solved; answer now.”
Lower threshold	The model is not making enough progress	False negative: it stops even though later reasoning could have succeeded	“This looks unlikely to improve; abstain, escalate, or stop spending.”

This distinction is the heart of the paper. Early stopping is not just about catching success early. It is also about recognizing failure early enough that failure does not become a premium-tier subscription plan.

The upper threshold saves tokens after convergence

The upper threshold is the familiar part. If the confidence signal rises above some cutoff, the system exits. In a reasoning model, this means that a partial reasoning trajectory is considered sufficient to elicit a final answer.

The risk here is straightforward. If the model exits early and the answer is wrong, the system has made a false positive error: it believed the answer was ready when it was not. The paper formalizes this as a predictive-performance loss for upper-threshold exits.

The efficiency loss is different. Suppose the model first becomes correct at step $t^\ast$ but exits only later. Every step after $t^\ast$ is wasted computation. The upper-threshold efficiency loss measures that delay relative to the full reasoning budget. This distinction matters because correctness and efficiency are not the same objective. A threshold can be safe but lazy. It can avoid wrong answers while still allowing the model to recite a small dissertation after solving the problem.

For business deployment, this maps cleanly to a familiar product choice. In customer support, compliance review, code generation, or financial research assistants, many tasks are “solved enough” before the model reaches its maximum budget. The upper threshold is a way to reclaim those unnecessary tokens while keeping the probability of premature wrong answers within a chosen tolerance.

But if the task is genuinely hard, the upper threshold has a blind spot. The confidence signal may never cross it. The model then keeps reasoning until the budget ends. This is where the lower threshold matters.

The lower threshold makes the model earn the right to continue

The lower threshold is the paper’s more interesting contribution. It does not ask whether the model is already confident. It asks whether confidence is improving fast enough to justify continued reasoning.

A static lower threshold would be too crude. Confidence can start low and rise gradually. Stopping just because early confidence is low would punish solvable problems before they have time to develop. The paper instead proposes a dynamic, parametric lower threshold: a progress schedule that becomes stricter as token usage grows.

In plain language:

$$ \text{continue reasoning only if the confidence trajectory is improving enough for this stage of the budget.} $$

If the confidence signal fails to keep up with that schedule, the system exits through the lower threshold. The intended interpretation is not “answer now because we are confident.” It is closer to “continued reasoning is unlikely to pay for itself.” In production, that lower-threshold exit should usually trigger abstention, fallback, retrieval, human review, a cheaper diagnostic path, or a different tool. Treating it as a normal answer would be operationally weird. The paper’s experiments also handle lower-threshold abstentions conservatively in some accuracy–token curves, counting them as wrong when measuring accuracy.

This is a useful shift. Many deployed AI systems do not merely need models that answer faster. They need models that know when the current path is not promising. A failed reasoning trajectory should be a signal, not just an invoice.

Risk control replaces threshold folklore

The dual-threshold mechanism creates a new problem: how should the thresholds be set?

The naive answer is validation tuning. Try different confidence signals and thresholds on a validation set, pick the ones that appear to satisfy the target risk, and hope the test distribution behaves. This is the ancient ritual of machine learning: sweep, select, deploy, suffer.

The paper argues that this is not enough, especially when the validation set is small and the threshold search is flexible. A threshold that looks safe on validation data can be safe only because it got lucky. The more candidate signals and thresholds we scan, the easier it becomes to exploit random downward fluctuations in empirical risk. That is not intelligence. That is overfitting wearing a lab coat.

Conformal Thinking uses distribution-free risk control, specifically a UCB-style finite-sample correction, to select thresholds. The workflow is:

Choose a user-specified risk tolerance, $\epsilon$.
Evaluate candidate confidence signals and thresholds on a held-out validation set.
Adjust empirical risk upward using a finite-sample correction.
Keep only candidates whose adjusted risk respects the target.
Among feasible candidates, choose the one with the best efficiency loss.

This matters because the output is no longer “threshold = 0.73,” a number with all the interpretability of a hotel-room thermostat. The output is “under this calibration setup, this stopping rule is selected to respect a target risk tolerance with high probability.”

That is the business-facing interface: not a confidence cutoff, but an error budget.

What the evidence is actually testing

The experiments are not one giant claim. They serve different purposes, and mixing them together would make the paper sound cleaner but less accurate. Here is the useful map.

Evidence component	Likely purpose	What it supports	What it does not prove
Figure 4: naive vs UCB risk control	Main evidence for risk-control calibration	UCB-style correction keeps realized test risk below target more reliably than naive validation tuning across random splits	It does not prove robustness under arbitrary distribution shift
Figure 5: signal ensemble	Efficiency evidence and signal-selection demonstration	Selecting the most efficient feasible signal can improve the accuracy–token trade-off versus committing to one signal	It does not imply any one signal is universally best
Figure 6: upper, lower, and dual thresholds under solvable/unsolvable ratios	Main mechanism evidence for the lower threshold	Lower-threshold gains grow when unsolvable instances are common; upper and lower thresholds do different jobs	It does not show the lower threshold is equally useful when most tasks are solvable
Figure 7: validation-set size ablation	Robustness/sensitivity test	UCB becomes more valuable when validation sets are small	It does not remove the need for representative validation data
Figure 8: length shift	Robustness/sensitivity test	Lower-threshold risk is more fragile under reasoning-length shift because its shape depends on the reasoning horizon	It does not support careless transfer across token-length regimes
Figure 9: dataset shift	Robustness/sensitivity test	UCB remains better controlled than naive tuning under the tested math/science dataset shift	It does not guarantee risk control under severe domain change

The setup is deliberately reasoning-heavy. The paper evaluates Qwen3-8B, Qwen3-30B-A3B, DeepSeek-R1-Distill-Qwen-32B, and Qwen3-VL-8B. The datasets include AIME, DeepScaleR, GPQA-Diamond, and MathVision. These are structured reasoning benchmarks, mostly math and science, with answer formats that make correctness checking and intermediate signal extraction feasible.

That scope is important. The paper is not claiming that any free-form business writing agent can now be safely stopped mid-thought with a universal threshold. It is working in settings where intermediate answers, final answers, and uncertainty signals can be operationalized. Boring, perhaps. Also how reliability work usually starts before marketing departments discover it.

The risk-control result is about calibration, not magic

The first empirical result verifies the core calibration idea. The authors use Qwen3-8B on AIME, generate 40 random validation-test splits, and use a small validation set of 50 samples, about 5 percent. They compare naive threshold selection with UCB-adjusted selection for both false positive and false negative risk.

The point is not that UCB makes the model smarter. It does not. It makes threshold selection less overconfident.

Naive validation can appear controlled on average, but individual runs can exceed the target risk. This is especially visible for lower-threshold false negative risk, where the threshold mechanism has more flexibility and is therefore more prone to noise. UCB correction makes the selected thresholds more conservative, so realized test risk stays below the target more consistently.

The business interpretation is simple: when the calibration dataset is finite, the validation estimate is not the deployment risk. If a system scans many stopping rules and picks the most efficient one, it must pay a statistical penalty for that search. Otherwise, it is quietly buying lower cost with unpriced risk.

This is the part of the paper that should interest AI operations teams. The method is not just “early stop to save tokens.” It is “early stop only after accounting for the uncertainty introduced by calibration itself.” Less glamorous. More deployable.

Signal selection is part of the product, not a detail

The paper evaluates several stopping signals: confidence, entropy-after-thinking-token style signals, token count, and, for Qwen3-8B, a probe trained on AIME reasoning trajectories. The results show that no single signal dominates everywhere. The efficient choice depends on the model, dataset, and risk level.

This is where risk control becomes useful beyond threshold setting. Given a target risk tolerance, the framework can choose not only a threshold but also the signal-threshold pair that minimizes efficiency loss among feasible candidates. In effect, it creates an ensemble over stopping signals: not by averaging them, but by selecting the operationally best one under a risk constraint.

For deployment, this suggests a more mature architecture than “we use confidence because confidence sounds reassuring.” A production reasoning system may need multiple monitoring signals:

Signal type	Possible role	Practical weakness
Confidence over elicited answer	Detect likely readiness to answer	Can be miscalibrated across tasks
Entropy-style uncertainty	Capture distributional uncertainty	May be noisy or hard to interpret
Token count	Simple budget baseline	Blind to instance difficulty
Probe-based signal	Directly predicts stepwise correctness	Requires training data and may transfer poorly

The paper’s framework does not require faith in one signal. It treats signals as candidates and selects among them under the same risk budget. That is the right attitude. Inference systems should not have favorite heuristics. They should have accountable operating points.

The lower threshold matters most when failure is common

The most operationally interesting experiment is Figure 6. The authors construct evaluation sets with controlled solvable-to-unsolvable ratios: 3:1, 1:1, and 1:3. They pool AIME and GPQA, label instances by whether the model can reach the correct final answer under the full token budget, and compare upper-only, lower-only, and dual-threshold policies.

The result is exactly what the mechanism predicts.

When solvable instances dominate, upper-threshold stopping captures most of the savings. The model often becomes confident enough to exit, so a lower threshold adds little. When unsolvable instances become common, upper-threshold-only stopping becomes inefficient because many runs never cross the confidence cutoff and therefore consume the full budget. Adding the lower threshold shifts the accuracy–token curve left: similar accuracy with fewer tokens.

The bottom-row analysis of Figure 6 is useful because it confirms the division of labor. Solvable instances tend to exit via the upper threshold. Unsolvable instances tend to exit via the lower threshold. That is not a cosmetic ablation. It shows that the two thresholds correspond to different operational states.

This is where the paper becomes relevant to agentic systems. Agents often face task mixes with unknown solvability: messy documents, incomplete databases, ambiguous user requests, broken APIs, missing permissions, or impossible instructions. A confidence-only stopping rule is poorly matched to such environments. It knows how to stop after success. It does not know how to stop after persistent non-progress.

The lower threshold is a first step toward progress-aware inference governance.

What businesses can use from this paper

The paper directly shows a risk-controlled method for early stopping in structured reasoning tasks. It shows that finite-sample correction improves realized risk control relative to naive threshold tuning, that signal selection improves efficiency, and that the lower threshold is especially useful when unsolvable instances are common.

Cognaptus’ business inference is broader but should be stated carefully: the same framing can inform how companies operate expensive reasoning and agentic systems. Not by copying the exact lower-threshold sigmoid into every workflow. That would be the classic enterprise AI move: take a research mechanism out of context and deploy it with the confidence of a dashboard. Rather, the transferable idea is the interface:

$$ \text{Reasoning policy} = \text{risk tolerance} + \text{calibration data} + \text{efficiency objective}. $$

That interface can support several product decisions.

Business problem	Risk-controlled reasoning interpretation	Likely operational action
High inference cost	Some tokens are spent after success or during hopeless non-progress	Calibrate early-exit policies by task type
Slow agent workflows	Agents keep trying when tool paths are failing	Add progress-aware lower exits and escalation
Unclear reliability targets	Teams tune thresholds without knowing what risk they imply	Express thresholds through error budgets
Model-specific behavior	A signal threshold transfers poorly across models	Calibrate per model, task, and signal
Overconfident validation results	Offline threshold sweeps look safer than deployment	Use finite-sample correction and holdout tests

The ROI logic is not “use fewer tokens everywhere.” The better logic is “allocate expensive reasoning only where marginal reasoning has expected value.” That distinction matters. A system that stops too aggressively can save money by producing bad outputs. Congratulations, the cloud bill is lower and the product is worse. The paper’s contribution is to make that trade-off explicit.

A practical deployment pattern

A company adapting this idea would probably not begin with the full research pipeline. A reasonable deployment pattern would look like this:

Segment tasks into categories with stable answer formats and comparable difficulty.
Collect validation traces with full-budget reasoning.
Define the stopping outcome: answer, abstain, escalate, retrieve more context, or switch tools.
Choose measurable losses: false confident answer, false early abstention, wasted token fraction, latency cost.
Extract candidate signals at intermediate reasoning steps.
Calibrate stopping policies against a target risk tolerance using finite-sample correction.
Monitor realized risk and recalibrate when model, prompt, task mix, or input length distribution changes.

The fourth step is where many AI teams will be tempted to cheat. They will want one metric. They will want it to fit every workflow. It will not.

For a coding assistant, a false positive might mean shipping a broken patch. For a legal-document assistant, it might mean giving an unsupported clause interpretation. For a customer-support agent, it might mean sending a confident but wrong refund instruction. For a market-research agent, it might mean fabricating a conclusion from insufficient evidence. Same word, different cost.

Risk tolerance is only useful when the loss function reflects the actual business damage.

Where the method thins out

The paper is appropriately clear about its boundaries.

First, the method assumes monotonic risk behavior with respect to the hyperparameters. In other words, changing thresholds should move risk in a predictable direction. That assumption is convenient and often plausible, but it is not a law of nature. If the confidence signal behaves pathologically, risk control has less to stand on.

Second, the two-step procedure for combining upper and lower thresholds is practical but not theoretically perfect. The paper notes that the upper risk guarantee can break if the lower threshold filters out proportionally more correct samples than incorrect ones. In plain terms: if the lower threshold is bad at identifying hopeless cases, it can distort the population left for the upper threshold. This is not a minor footnote. It is the exact failure mode operators should watch.

Third, the lower threshold is sensitive to reasoning length. The length-shift ablation shows that when validation traces are shorter than test traces, false negative risk for the lower threshold becomes more fragile. This makes sense. A progress schedule calibrated for one reasoning horizon may not behave well under another. Time-dependent thresholds are useful precisely because they depend on time. Shocking, but apparently worth repeating.

Fourth, the experiments focus on structured math, science, and vision-language reasoning tasks. The authors explicitly note that existing uncertainty signals are less suitable for tasks with less structured reasoning and output. That limits immediate transfer to open-ended writing, negotiation, strategy, or business analysis agents.

The practical boundary is therefore:

Safe reading	Unsafe reading
Risk-controlled early stopping can improve compute efficiency under calibrated, structured conditions	Any reasoning model can now be safely stopped early in any task
Lower thresholds help when unsolvable instances consume budget	Lower thresholds are universally useful
UCB correction reduces validation overfitting	UCB eliminates distribution shift
Signal selection improves efficiency among tested candidates	The best signal will transfer across models and domains

This is not a weakness of the paper. It is the difference between a research contribution and a magic amulet.

The real lesson is not “think less”

The title temptation is to say this paper teaches LLMs to think less. That is close, but not quite right.

The better interpretation is that it teaches systems to treat thinking as a controlled resource. Sometimes the right action is to stop because the model is confident. Sometimes it is to stop because the trajectory is going nowhere. Sometimes it is to continue because the risk budget justifies more computation. And sometimes it is to abstain, escalate, or ask for missing information instead of pretending that another 8,000 tokens will summon truth from the void.

The industry has spent the last few years treating test-time compute as a performance lever. More thinking, better answers, larger budgets, grander demos. Useful, up to a point. But deployment is where averages go to be audited. A real system needs operating points, tolerances, monitoring, and failure modes that can be named before they become incidents.

Conformal Thinking is valuable because it makes reasoning budgets governable. It moves the conversation from “how long should the model think?” to “what risk are we taking when we stop?” That is a more adult question. Naturally, it is also less fun at parties.

But for AI products that need to be reliable, affordable, and inspectable, it is the right question.

Cognaptus: Automate the Present, Incubate the Future.

Xi Wang, Anushri Suresh, Alvin Zhang, Rishi More, William Jurayj, Benjamin Van Durme, Mehrdad Farajtabar, Daniel Khashabi, and Eric Nalisnick, “Conformal Thinking: Risk Control for Reasoning on a Compute Budget,” arXiv:2602.03814v2, 2026, https://arxiv.org/abs/2602.03814. ↩︎

Early stopping has two failure modes, not one#

The upper threshold saves tokens after convergence#

The lower threshold makes the model earn the right to continue#

Risk control replaces threshold folklore#

What the evidence is actually testing#

The risk-control result is about calibration, not magic#

Signal selection is part of the product, not a detail#

The lower threshold matters most when failure is common#

What businesses can use from this paper#

A practical deployment pattern#

Where the method thins out#

The real lesson is not “think less”#