An AI system gives an answer. The answer looks plausible. The reasoning trace is long enough to seem serious. The user asks the next question, which is the one that actually matters:
How sure is it?
For ordinary software, this question is already annoying. For reasoning language models, it is worse. These models do not just emit a short response; they may spend thousands of tokens walking through a problem before landing on an answer. Asking them again is not free. Asking them eight times is not diligence. It is a budget line with philosophical decoration.
That is why the paper How Uncertainty Estimation Scales with Sampling in Reasoning Models is useful.1 It studies a very practical question: when reasoning models are expensive to sample, should we estimate reliability by asking for confidence, by checking whether multiple answers agree, or by combining the two?
The paper’s answer is not “sample more.” Conveniently, the answer is also not “trust the model when it says it is confident,” which would make deployment engineering a little too magical. The strongest practical result is narrower and more interesting: a two-sample hybrid of verbalized confidence and self-consistency beats deeper scaling of either signal alone, even up to eight samples.
That is the kind of result enterprise AI teams should read as a compute-allocation rule, not as a benchmark curiosity.
The obvious strategy is to ask the model many times
The standard intuition behind sampling-based uncertainty is simple. If the model is asked the same question several times and keeps giving the same answer, we treat the answer as more reliable. This is self-consistency: agreement as confidence.
There is a second family of methods: ask the model how confident it is. This is verbalized confidence. It sounds almost embarrassingly simple, but in many language-model settings it provides useful ranking information. Not perfect calibration. Not a legally binding probability. But useful discrimination between answers more likely to be correct and answers more likely to be wrong.
The difference becomes economically important once we move from normal LLMs to reasoning language models. A reasoning model sample is not just another short completion. It is another full reasoning trace. If each trace is long, then increasing $K$ from 2 to 8 means paying for six more chains of thought. The old sampling advice—“just generate more candidates”—starts to look like buying six additional smoke alarms for a kitchen already on fire.
The paper compares three strategies:
| Strategy | Signal used | Practical intuition | Cost profile |
|---|---|---|---|
| Verbalized confidence (VC) | Model-reported confidence | “The model tells us how sure it is.” | Available from one sample; can be averaged across samples |
| Self-consistency (SC) | Agreement across sampled answers | “If repeated samples agree, the answer is probably safer.” | Requires at least two samples; improves with more |
| SCVC hybrid | Confidence plus agreement | “Use both introspection and cross-sample agreement.” | Requires two samples, then can scale further |
The paper’s comparison-based framing matters because the real deployment question is not whether uncertainty estimation is useful. Of course it is. The real question is which reliability signal gives the best marginal return when every extra sample is expensive.
What the paper actually tests
The authors evaluate black-box uncertainty estimation across three open-source reasoning models: gpt-oss-20b, Qwen3-30B-A3B, and DeepSeek-R1-8B. The task suite spans 17 tasks across mathematics, STEM, and humanities. The math tasks include MMLU-Pro Math, GSM8K, and AIME 2024/2025; the non-math tasks include MMLU-Pro subject areas and GPQA Diamond.
The metric is AUROC. This is important. AUROC measures whether a confidence signal ranks correct answers above incorrect answers. It does not mean the model’s “80% confidence” is actually calibrated to 80% correctness.
That distinction should not be treated as fine print. It determines what the result can safely support.
If AUROC improves, the uncertainty signal is better at separating answers that are likely correct from answers that are likely wrong. That is valuable for triage, routing, abstention, escalation, and selective prediction. It does not automatically give a production-ready probability threshold. If someone turns AUROC into “the model knows exactly how likely it is to be right,” they have mistaken a ranking test for a crystal ball. Happens often. Still not recommended.
The authors generate pools of samples per question, then use bootstrap draws to estimate performance at different sample budgets. The main budgets reported are $K=1$, $K=2$, $K=5$, and $K=8$. This setup is well aligned with the actual operational question: how much uncertainty quality do we buy as we spend more sampling compute?
Read the evidence as a cost comparison, not a leaderboard
The main table in the paper reports AUROC for VC, SC, and SCVC across domains and sampling budgets. The compressed version is below.
| Domain average | VC $K=1$ | VC $K=8$ | SC $K=2$ | SC $K=8$ | SCVC $K=2$ | SCVC $K=8$ |
|---|---|---|---|---|---|---|
| Mathematics | 71.3 | 81.4 | 70.6 | 79.4 | 84.2 | 88.4 |
| STEM | 73.8 | 78.3 | 66.6 | 75.5 | 80.2 | 82.0 |
| Humanities | 68.5 | 72.6 | 63.3 | 71.3 | 74.9 | 77.0 |
The table should be read in three steps.
First, verbalized confidence is already a strong baseline. At one sample, it reaches 71.3 AUROC in mathematics, 73.8 in STEM, and 68.5 in humanities. That does not mean the model’s stated confidence is perfectly calibrated. It means the confidence score contains useful ranking information.
Second, self-consistency starts weaker. At $K=2$, SC reaches 70.6 in mathematics, 66.6 in STEM, and 63.3 in humanities. It improves as $K$ grows, but it does not overtake verbalized confidence within the tested range.
Third, the hybrid is the real story. SCVC at $K=2$ reaches 84.2 in mathematics, 80.2 in STEM, and 74.9 in humanities. In all three domains, two hybrid samples beat eight samples of either VC or SC alone.
This is not a subtle “statistically detectable but commercially sleepy” improvement. In mathematics, SCVC at two samples beats VC at eight samples by 2.8 AUROC points and SC at eight samples by 4.8 points. In STEM, it beats VC@8 by 1.9 and SC@8 by 4.7. In humanities, it beats VC@8 by 2.3 and SC@8 by 3.6.
The practical interpretation is blunt: if your reliability layer spends extra calls only to make one signal deeper, it may be buying redundancy. The paper suggests that the first extra sample is most valuable when it creates a second kind of signal.
Why two samples can beat eight
The hybrid works because verbalized confidence and self-consistency are not identical signals, especially at low sample counts.
Verbalized confidence is introspective. It asks the model to report its own uncertainty about the answer. Self-consistency is behavioral. It observes whether multiple sampled answers converge. These signals can disagree. When they disagree, the disagreement itself is informative.
Consider four simplified cases:
| Confidence | Agreement | Possible reading | Operational response |
|---|---|---|---|
| High | High | Stable and internally confident | Usually safe to proceed |
| High | Low | Confident but unstable | Escalate or resample; overconfidence risk |
| Low | High | Consistent but hesitant | Check task ambiguity or answer format |
| Low | Low | Unstable and unsure | Route to fallback or human review |
A pure VC system sees only the confidence column. A pure SC system sees only the agreement column. The hybrid sees the interaction.
The paper’s correlation analysis supports this interpretation. It finds that rank correlation between VC and SC increases with more samples: as $K$ grows, the two signals become more aligned. In other words, complementarity is front-loaded. Early on, confidence and agreement capture different uncertainty information. Later, they converge toward a more shared view of uncertainty, so the marginal value of adding more samples shrinks.
That is precisely why $K=2$ matters. The second sample creates the first opportunity to measure agreement while still preserving the distinct information inside verbalized confidence. By $K=8$, both signals have had more time to become redundant. More evidence is good. Redundant evidence is less good. Enterprise AI teams occasionally rediscover this after receiving the invoice.
Self-consistency is not useless; it is just inefficient alone
A lazy reading of the paper would say: “Self-consistency is bad.” That is not the result.
Self-consistency improves with more samples. In mathematics, SC rises from 70.6 at $K=2$ to 79.4 at $K=8$. In STEM, it rises from 66.6 to 75.5. In humanities, from 63.3 to 71.3. This is meaningful scaling.
The problem is sample efficiency. SC needs more samples to become competitive, and reasoning-model samples are expensive. In a shallow generation setting, where each sample is short, that might be acceptable. In a reasoning-model setting, each sample may contain a long deliberation trace. The cost of waiting for agreement becomes nontrivial.
This matters for agentic workflows. Many agent systems use repeated calls as a verification strategy: ask the model again, compare outputs, ask another model, vote, repeat. That architecture can work, but it is often designed as if inference cost were a secondary detail. The paper argues, indirectly but clearly, that cost should be part of the uncertainty design itself.
A better default is not “more votes.” It is “more diverse uncertainty evidence per call.”
Mathematics benefits most, but that is not a universal deployment license
The domain pattern is one of the paper’s most useful findings. Mathematics shows the strongest scaling and the strongest hybrid gain. Moving from VC@1 to SCVC@2 gives mathematics a +12.9 AUROC gain. STEM and humanities both show smaller initial gains of about +6.4 AUROC.
The later scaling pattern also differs. In mathematics, SCVC continues improving from 84.2 at $K=2$ to 86.8 at $K=5$ and 88.4 at $K=8$. In STEM, SCVC rises from 80.2 to 82.0 across the same range. In humanities, it rises from 74.9 to 77.0.
The paper does not prove the training cause, but the interpretation is plausible: reasoning models are heavily optimized for mathematical reasoning, often through reinforcement learning with verifiable rewards. Math problems also have clearer correctness boundaries than many humanities questions. That makes both introspective uncertainty and cross-sample agreement easier to exploit.
For business use, this creates a useful but uncomfortable rule:
| Deployment domain | What the paper suggests | What remains uncertain |
|---|---|---|
| Math-heavy reasoning | Hybrid uncertainty is especially promising; gains persist beyond $K=2$ | Need validation on the organization’s own problem distribution |
| STEM and technical QA | Hybrid still helps, but returns saturate earlier | AUROC gains may not translate directly into calibrated production thresholds |
| Humanities, policy, legal, business reasoning | Hybrid improves ranking but less dramatically | Ambiguity, multi-answer questions, and subjective judgment complicate correctness labels |
So no, this result does not mean every enterprise can slap SCVC onto a chatbot and declare reliability solved. The more heterogeneous the task, the more careful the validation must be. But it does suggest a sensible starting point: use a two-sample hybrid before trying brute-force voting.
The appendix tests robustness, not a second thesis
The paper includes two analyses that are easy to misread as side quests. They are better understood as robustness and implementation guidance.
| Test or analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Weighting parameter for SCVC | Robustness / sensitivity test | The hybrid does not require delicate tuning; many non-degenerate mixes work | It does not identify a universal optimal weight for every deployment |
| Rank correlation between VC and SC | Mechanism analysis | Hybrid gains are largest when signals are less correlated and shrink as signals align | It does not show causal independence between the signals |
| Six confidence elicitation / judge variants | Method comparison and implementation detail | More complex confidence extraction is often not worth the cost | It does not prove all judge models are useless in all settings |
| Per-task and per-model appendix tables | Breakdown / heterogeneity check | The headline result is not only a single aggregate artifact | It does not remove the need for domain-specific validation |
The weighting analysis is especially practical. The authors test different weights between VC and SC in the hybrid signal. Performance is stable across a wide interior range and degrades mainly when one signal is removed entirely. This is good news for implementation. The hybrid advantage does not appear to depend on discovering a sacred coefficient through ritual grid search.
The confidence-variant comparison is also worth reading carefully. The authors compare elicitation methods, where the model reports confidence alongside the answer, with judge methods, where a separate pass reads the reasoning trace and outputs confidence. The results are mixed by domain: vanilla elicitation is strongest in STEM at 77.15 AUROC; verification judging is strongest in mathematics at 73.82; humanities differences are narrower, with judge variants slightly ahead in some cases. The consistently weak method is epistemic-marker judging, at 67.59 in math, 72.00 in STEM, and 67.97 in humanities.
The practical point is not that all judges are bad. The point is cost-benefit. Judge-based methods require an additional pass over the reasoning trace. SCVC with two samples delivers a larger gain while also producing agreement information. If the reliability layer must spend another reasoning-model pass, the paper suggests spending it on a second sample plus hybrid scoring before spending it on fancy introspection theater.
What Cognaptus would infer for production systems
The paper directly shows a benchmark result: across the tested models and tasks, a two-sample hybrid of verbalized confidence and self-consistency outperforms deeper scaling of either signal alone for uncertainty discrimination.
The business inference is broader but still disciplined: in many reasoning-model workflows, reliability should be designed as a signal-fusion problem before it becomes a brute-force sampling problem.
A practical pipeline might look like this:
- Ask the reasoning model for an answer and a confidence score.
- Draw one additional independent sample with the same answer format and confidence requirement.
- Extract the final answers and confidence scores.
- Compute agreement between answers.
- Combine verbalized confidence and self-consistency into a single ranking score.
- Use that score for routing: accept, resample, retrieve more evidence, call tools, or escalate.
The operational value is not simply lower inference cost. It is cheaper diagnosis.
A system that only samples repeatedly can tell whether answers agree. A system that also asks for confidence can detect overconfident disagreement and uncertain agreement. That creates better routing options. In finance research, compliance review, technical QA, document automation, or coding assistance, this matters because the system rarely needs only a final answer. It needs to decide what to do next.
For example:
| Workflow | How SCVC can help | Boundary |
|---|---|---|
| Financial research assistant | Route low-SCVC answers to retrieval, analyst review, or additional model checks | Does not replace regulated financial advice controls |
| Code generation | Flag cases where two traces disagree or confidence is low before executing changes | Needs tests, static analysis, and sandboxing |
| Contract or policy QA | Detect unstable answers before presenting them as confident summaries | Legal ambiguity may break simple correctness assumptions |
| Customer-support automation | Escalate cases with low hybrid reliability rather than relying on one polished answer | Requires calibration against actual support outcomes |
| Agentic task planning | Decide whether an agent should act, ask for more context, or call tools | Multi-step errors can compound beyond single-answer uncertainty |
This is where the paper becomes more than an evaluation result. It suggests a design principle: do not treat “reasoning longer” and “knowing reliability” as the same thing.
A model can produce a long chain of thought and still be wrong. A model can be consistent and still be consistently wrong. A model can say it is confident and still be performing confidence cosplay in a lab coat. Reliability comes from how signals behave against correctness, not from how authoritative the trace looks.
The limits are precise, not decorative
There are four boundaries that matter for implementation.
First, the paper evaluates discrimination using AUROC, not calibration. If a production system needs probabilities that map cleanly to risk thresholds, additional calibration is required. AUROC can support ranking and routing; it cannot by itself guarantee that “0.8” means “80% likely correct.”
Second, the study uses three mid-sized open-source reasoning models. That is broad enough to be informative, but it is not a guarantee for every closed model, every model size, or every future post-training recipe.
Third, the tasks are mostly benchmark tasks with defined correctness labels. Real enterprise tasks may involve ambiguous instructions, incomplete data, contested standards, or multiple acceptable answers. The further the workflow moves away from verifiable answers, the more the SCVC score needs local validation.
Fourth, the hybrid score is an uncertainty estimator, not a truth oracle. It tells the system which answers look more or less reliable under the tested signals. It does not verify facts, inspect databases, run code, check legal authority, or understand whether the business context changed yesterday afternoon.
These limits do not weaken the paper’s practical value. They clarify where the value sits: SCVC is a cheap reliability layer for ranking and routing reasoning outputs. It is not a substitute for evidence, tools, or governance.
The decision rule is simple: buy signal diversity before sample depth
The paper’s strongest contribution is not merely that SCVC performs well. It is that the comparison changes the default compute strategy.
The common instinct is:
“If uncertainty matters, ask the model more times.”
The better rule suggested by this paper is:
“If uncertainty matters, first make the second sample produce a different kind of evidence.”
That difference is small in code and large in system design.
Instead of building reliability layers that spend compute on repeated agreement alone, teams can build layers that combine introspective and behavioral uncertainty. The second sample then does two jobs: it contributes another confidence estimate and creates the first agreement signal. That is why two samples can beat eight.
Not because two is magic. Because the second sample changes the information structure.
The less glamorous version is also the more useful one: reasoning models do not just need more thinking. They need cheaper ways to know when their thinking should not be trusted.
And in enterprise AI, knowing when not to trust the answer is often the most profitable feature nobody wants to demo.
Cognaptus: Automate the Present, Incubate the Future.
-
Maksym Del, Markus Kängsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya, and Meelis Kull, “How Uncertainty Estimation Scales with Sampling in Reasoning Models,” arXiv:2603.19118, 2026. https://arxiv.org/abs/2603.19118 ↩︎