Error Bars for the Algorithmic Mind: What ReasonBench Reveals About LLM Instability

A demo is not a deployment.

In a demo, the model answers once. The answer looks correct. The cost looks tolerable. The team nods, the slide deck gains a green checkmark, and someone says the usual fatal sentence: “This seems reliable enough.”

Then production happens. The same prompt goes through the same provider endpoint. The same workflow runs again. Sometimes the answer changes. Sometimes the reasoning trace wanders. Sometimes the bill is higher. Sometimes a supposedly more “thoughtful” strategy spends extra tokens to become confidently less useful. Beautiful. The machine has developed not consciousness, but variance.

That is the problem ReasonBENCH tries to make measurable.¹ The paper’s central move is simple but important: stop treating a reasoning system as if it has one benchmark score. A model-strategy-task configuration is not a point. It is a distribution of possible outcomes. Quality varies. Cost varies. The ranking between systems can vary. And the uncomfortable part is that this instability persists even when the input is unchanged.

For business readers, the paper is not mainly about which model wins. That is the leaderboard reflex, and the leaderboard reflex is exactly what the paper is trying to discipline. The more useful lesson is this: if an AI workflow will be executed repeatedly, evaluated repeatedly, or charged repeatedly, then its stability profile is part of the product. Mean accuracy alone is not enough. It is the algorithmic version of judging a factory by one perfect unit produced during a visitor tour.

The benchmark score is a mean pretending to be a personality

The standard evaluation habit is to compress a model or reasoning method into a single score: accuracy, pass rate, exact match, win rate, whatever the benchmark prefers. This is convenient. It is also a quiet act of information destruction.

A single benchmark score answers a narrow question: what was the observed average under this evaluation run? It does not answer the operational question: what range of outcomes should I expect if I run this system again tomorrow, under the same configuration, through the same API, against the same task family?

ReasonBENCH targets that second question. It evaluates repeated executions of reasoning systems rather than treating one execution as definitive. The paper runs 30 independent trials across reasoning strategies, models, and tasks, reporting quality and cost as empirical distributions. The task suite covers Game of 24, SciBench, HumanEval, HotpotQA, Humanity’s Last Exam, and Shakespearean Sonnet Writing. That mix matters because the authors are not only checking mathematical puzzles or code generation. They are testing whether instability appears across different reasoning regimes: symbolic arithmetic, scientific reasoning, programming, multi-hop question answering, hard general reasoning, and constrained creative generation.

The unit of analysis is also carefully chosen. This is not only a model benchmark. ReasonBENCH studies two axes:

Axis	What varies	What stays fixed	What the axis tests
Strategy axis	Reasoning method	GPT-4.1 Nano backbone	Whether scaffolding, search, reflection, voting, or agent-style procedure changes quality, cost, and stability
Model axis	Model	Shared input-output prompting setup	Whether base model capability changes quality, cost, and stability

That separation is one of the paper’s better design choices. In most product conversations, “use a stronger model” and “use a better reasoning strategy” are treated as interchangeable remedies. They are not. A stronger model changes the base capability available to the system. A reasoning strategy changes the procedure through which that capability is used. One raises the ceiling. The other changes the path taken through the room. Occasionally the path leads into a wall.

ReasonBENCH separates structural unevenness from rerun instability

The paper’s most useful conceptual contribution is its distinction between two kinds of noise: Global Noise and Run Noise.

Global Noise measures cross-benchmark unevenness after normalization. A system with high Global Noise may perform very well on some tasks and poorly on others. It is not necessarily random; it may simply be specialized. In business terms, Global Noise is the “great in finance, useless in customer support” problem.

Run Noise measures within-benchmark variance across repeated executions. A system with high Run Noise is unstable even within a given task environment. In business terms, Run Noise is the “same invoice, different answer” problem.

That distinction is not cosmetic. It tells you what kind of failure you are dealing with.

Stability pattern	Technical meaning	Business reading	Likely remedy
Low Global Noise, low Run Noise	Performs evenly across tasks and reruns	General-purpose candidate	Broader deployment may be reasonable, still with monitoring
High Global Noise, low Run Noise	Consistent but task-sensitive	Specialist tool	Route it only to matching task families
Low Global Noise, high Run Noise	Broad but rerun-unstable	Looks good in aggregate, unreliable per case	Add repeated-run checks, confidence gates, or fallback logic
High Global Noise, high Run Noise	Task-sensitive and rerun-unstable	Expensive trouble with a dashboard	Use only with strong containment, or preferably not at all

This is where the paper moves beyond ordinary “models are stochastic” commentary. The authors do not merely say variance exists. They show that instability has structure. Strategies populate different parts of the Global Noise / Run Noise plane, and the quadrant a strategy occupies is linked to architecture. Direct methods, search procedures, adaptive methods, planning-style methods, and evolutionary-style methods do not merely differ in average score; they differ in the shape of their outcome distributions.

This matters because a procurement decision based only on mean score can select the wrong system for the wrong reason. A high-scoring but high-Run-Noise system may be acceptable for brainstorming. It may be a bad choice for compliance triage. A high-Global-Noise system may be excellent if routed to its specialty and terrible if sold internally as a general assistant. Same score, different operational meaning. Annoying, but reality does enjoy paperwork.

The winners win, but not always securely

The paper’s headline comparisons are useful precisely because they complicate the usual ranking story.

On the strategy axis, using GPT-4.1 Nano as the backbone, Fleet of Agents reaches the highest reported average quality among the strategy panel: 0.4549 with a 95% confidence interval of [0.45, 0.46]. Tree of Thoughts with breadth-first search follows at 0.4145 [0.40, 0.42]. Graph of Thoughts and RAP also outperform simple direct prompting on average, though at much higher cost. The basic IO baseline sits at 0.1341 [0.13, 0.14], while Chain-of-Thought improves to 0.2792 [0.27, 0.29].

That looks like a clean story: more elaborate reasoning strategies generally help. But the distributional view adds a second sentence: they do not help in the same way, and their costs do not scale politely.

FoA has the best mean quality in the strategy table and relatively low quality noise compared with many alternatives. But it is not cheap. ToT-BFS and GoT also perform well, yet their costs are much higher than direct prompting. ToT-DFS is the cautionary joke: it costs more than direct prompting and performs worse. “More reasoning” is not automatically intelligence; sometimes it is just a more expensive route to the wrong answer.

On the model axis, the main model table reports a clearer quality leader. Gemini 3 Flash achieves an average quality of 0.7615 [0.75, 0.77], followed by DeepSeek V3.2 at 0.6479 [0.64, 0.66] and GPT-5.4 Mini at 0.6220 [0.61, 0.63]. DeepSeek R1 is at the bottom of that table with 0.2326 [0.22, 0.24], while also having the highest average cost in the same table. That is not a typo in the interpretation: the most expensive model in that comparison is not the best performer.

The paper also reports that the top strategy beats its nearest competitor in only 77% of head-to-head runs, while a top model comparison reaches 88%. Those are not terrible win rates. They are also not the near-certainty implied by a neat leaderboard row. A single observed score can still misrank systems when distributions overlap.

The practical reading is not “never rank models.” Ranking is necessary. The practical reading is: rank with error bars, and do not pretend overlapping distributions are decisive because the table looks tidy.

Most variance is structured, but the leftover still hurts

One of the paper’s more subtle findings comes from its hierarchical variance decomposition. The authors expand runs into item-level observations and allocate variance across benchmark effects, system effects, benchmark-system interaction, item difficulty, system-item interaction, repeat effects, token usage, and residual variance.

The result is easy to misread, so it deserves careful handling.

The paper finds that roughly three quarters of variance is structured on both the strategy and model axes. That means much of the variation is not mysterious rerun chaos. It is linked to benchmark choice, system behavior, item difficulty, and their interactions. In plain language: systems fail differently because tasks differ, items differ, and methods interact with those differences.

That should be reassuring, but only partly. Structured variance is still variance. If a system performs well on one kind of task and badly on another, the problem does not disappear because we can name it. Naming the shark is not the same as making swimming safe.

The residual quarter is the uncomfortable part. After the authors account for observed design factors, a meaningful portion remains unexplained. This residual does not prove mystical model behavior. It simply means that the evaluation pipeline cannot fully explain why repeated runs differ. The paper points to possible sources such as provider-side infrastructure, non-determinism, mixture-of-experts routing, and strategy-internal randomness. It does not claim to mechanistically isolate all of them.

That boundary matters. ReasonBENCH is strongest as an evaluation framework and diagnostic lens. It is not a full causal theory of every divergent reasoning path. Still, for deployment decisions, the distinction may be secondary. If repeated calls produce materially different outcomes and costs, the operations team must manage the variance whether or not the exact micro-cause has been found.

The convenient fixes mostly fail

The paper’s source-elimination tests are especially valuable because they target the easy excuses. These tests should not be read as a separate thesis. They are controlled interventions designed to ask whether common explanations actually reduce instability.

Test	Likely purpose	What it supports	What it does not prove
Temperature sweep	Robustness / sensitivity test for decoding randomness	Variance persists even under greedy decoding; token sampling alone is not sufficient	It does not identify the full upstream source of non-determinism
Prompt and parser refinement	Controlled intervention separating bias from variance	Better prompts and parsers improve mean quality, but confidence-interval widths remain similar	It does not mean prompt engineering is useless; it means it is not a stability cure
Oracle evaluator on Game of 24	Causal diagnostic for evaluator quality in search	Better intermediate evaluation improves mean quality and reduces variance	It is shown in a task where ground-truth evaluation is available, not every domain
Reasoning-effort sweep	Sensitivity test for test-time compute scaling	More effort raises output tokens but does not reliably improve quality or reduce variance	It does not rule out all compute-scaling benefits under all methods
Model-scale comparison	Exploratory extension / scale case study	Larger siblings improve mean quality and tighten distributions within model families	It does not eliminate variance or replace system-level evaluation

The temperature result is the first useful slap on the wrist. Many practitioners assume that if variance exists, the fix is to lower temperature. ReasonBENCH tests this and finds the run distribution does not collapse under greedy decoding. In several cells, the interquartile range at temperature zero is comparable to, or wider than, the range at higher temperature. In other words, turning down the randomness knob does not turn the system into a spreadsheet.

The prompt and parser result is also important. Refining prompts and output parsing improves average quality across all ten strategies in the reported table. IO improves from 0.106 to 0.313. CoT-SC improves from 0.228 to 0.410. FoA improves from 0.4580 to 0.546. These are not trivial gains. But the paper’s interpretation is that these refinements correct systematic error more than run-to-run variance. Better formatting helps the system be less wrong on average. It does not make repeated executions fully stable.

The evaluator-quality intervention is the positive exception. For search-based methods on Game of 24, replacing a heuristic evaluator with a ground-truth evaluator improves quality and reduces variance. That is an important mechanism: if a reasoning strategy searches through intermediate states, noisy intermediate evaluation can amplify instability before the final answer is even produced. For businesses using agentic workflows, this is the part to underline. If your workflow uses intermediate scoring, verification, ranking, or critique, evaluator quality is not an accessory. It is part of the control system.

The reasoning-effort result is the most operationally blunt. The paper tests low, medium, and high effort settings for selected reasoning models. Output tokens rise. Quality gains are not statistically reliable after accounting for repeated-run variability. Variance does not reliably fall. This does not mean test-time compute is always useless. It means “let it think harder” is not a reliability strategy unless you can show, for your task and configuration, that the extra compute buys something measurable.

Very technical. Very unromantic. Also probably cheaper.

Models and strategies are different axes, not two names for the same lever

One of ReasonBENCH’s strongest business lessons is that model choice and strategy choice should not be merged into a single “AI capability” conversation.

A model changes the underlying competence distribution. A strategy changes how the system explores, aggregates, evaluates, or revises possible answers. These interact, but they are not substitutes. A stronger model may raise mean quality. A structured strategy may reduce, amplify, or redirect variance. A search-heavy strategy may increase both quality and cost, or just cost. An evaluator-dependent strategy may become unstable if the evaluator is noisy. A direct method may be weaker but cheaper and more predictable.

The appendix cross-axis analysis reinforces this distinction. Benchmark correlations differ across the model and strategy axes. On the model axis, benchmarks cluster more by content domain. On the strategy axis, they cluster more by reasoning mode. That is exactly what one would expect if models and strategies are affecting different parts of the system.

The paper also notes zero-score collapses, especially where a strategy fails because its procedure or format mismatches the task rather than because the underlying model has no capability. This is a useful warning for applied teams. If a strategy fails on a benchmark, the failure may not mean the model cannot solve the task. It may mean the scaffold is badly matched to the task interface. In workflow design, “the model is weak” and “the orchestration is wrong” are different diagnoses. Confusing them is how companies buy larger models to solve smaller design mistakes.

Cost is a reliability variable, not a procurement footnote

ReasonBENCH treats cost as a distribution alongside quality. That is not merely a finance detail. It changes the risk analysis.

The paper’s joint cost-quality failure tables ask a practical question: how often does a system produce both low quality and high cost under a given threshold rule? This is closer to real deployment pain than average score alone. A bad cheap answer may be tolerable in exploratory use. An expensive bad answer is the kind that produces meetings.

Under the median-threshold analysis for strategies, IO and CoT have 0.0% joint failure because they never trip the cost threshold, even though IO has poor quality. FoA has the best quality-failure rate among strategies at 12.8%, but because it is costly, its joint failure is also 12.8%. GoT reaches 43.3% joint failure, RAP 41.3%, and ToT-DFS 25.0%.

On the model axis under the same threshold logic, DeepSeek R1 reaches 81.7% joint failure, Claude Haiku 4.5 reaches 60.0%, and GPT-5 Nano reaches 45.0%. Gemini 3 Flash is expensive under this threshold, failing the cost condition on every run, but rarely falls below the quality threshold, producing only 1.7% joint failure. The pattern is not “cheap is always better” or “expensive is always worse.” The pattern is asymmetry: cheap systems can be structurally immune to joint cost-quality failure because they cannot fail the cost side, while expensive systems remain exposed even when their quality is strong.

The authors test threshold sensitivity using stricter criteria and per-pair mean-standard-deviation thresholds. The qualitative pattern remains: cost-quality risk is not reducible to average price or average accuracy. It is a joint distribution problem.

For business use, this shifts the evaluation question from:

Which system has the highest score?

to:

Which model-strategy pair gives acceptable quality with an acceptable probability of cost-quality failure under our task mix?

That is less glamorous. It is also the question that keeps budgets alive.

How an applied team should change its evaluation protocol

ReasonBENCH does not imply that every company must reproduce the whole paper before deploying a chatbot. It does imply that single-run evaluation is underpowered for serious workflow decisions.

A practical evaluation protocol should include at least five changes.

First, run repeated trials for each serious model-strategy-task configuration. The paper uses 30 runs, supported by its power analysis, and reports stratified-bootstrap confidence intervals. A smaller internal evaluation may be acceptable for low-risk tasks, but the principle should remain: do not infer deployment reliability from one execution.

Second, report mean quality together with confidence intervals, Global Noise, and Run Noise. If your team cannot compute the exact ReasonBENCH metrics, use simpler approximations: per-task mean, per-task standard deviation, cross-task spread, and rerun variance. The exact metric matters less than the refusal to worship a naked average.

Third, evaluate cost as a random variable. Track not only average spend but output-token dispersion, high-cost runs, and joint cost-quality failure. A system whose bad runs are also expensive should be treated differently from a system whose bad runs are cheap.

Fourth, validate the axis you are comparing. If you are choosing among models, use a benchmark suite that discriminates models for your task family. If you are choosing among strategies, test strategies directly. Do not assume a benchmark that ranks models well will also rank orchestration designs well.

Fifth, invest in evaluator calibration before spending effort on superficial decoding tweaks. For search-based and agentic workflows, the intermediate evaluator is often the hidden steering wheel. If it is noisy, the workflow can wander expensively. Lowering temperature may feel like control; ReasonBENCH suggests it is often theater.

Here is the operational translation:

Paper result	Directly shown	Cognaptus inference for business use	Boundary
Repeated runs produce overlapping quality and cost distributions	Same configuration can yield different outcomes and costs	Use repeated-run evaluation before procurement or workflow approval	The paper’s 30-run protocol supports moderate-tail analysis, not extreme guarantees
Global Noise and Run Noise separate instability types	Cross-task unevenness and within-task rerun variance differ	Route specialists and guard unstable generalists differently	Metrics depend on benchmark mix and normalization
Temperature and prompt/parser fixes do not remove variance	Common quick fixes do not solve repeated-run instability	Do not treat formatting and decoding knobs as reliability engineering	Prompt refinement still improves mean quality
Oracle evaluator reduces variance in Game of 24 search	Intermediate evaluation quality affects stability	Audit evaluators, critics, rankers, and verifiers in agentic workflows	Ground-truth evaluators are not available for every business task
Cost and quality decouple asymmetrically	Expensive systems can still fail jointly on cost and quality	Optimize model-strategy pairs by joint risk, not score alone	Threshold choice must match business context

This is not a call to make every AI pilot bureaucratic. It is a call to distinguish exploration from production. For exploration, one good answer may be enough to continue. For production, one good answer is only an anecdote with nice formatting.

Where the paper’s conclusions should not be overextended

The paper is strong, but its boundary conditions matter.

It evaluates six tasks, ten reasoning strategies, and a specific set of model configurations, with the main model table focused on contemporary reasoning-model configurations and appendix material extending the model panel with additional baselines. Other domains, especially long-horizon tool use, regulated professional workflows, multimodal tasks, or interactive customer environments, may show different variance structure.

The 30-run design is useful for estimating means, confidence intervals, and moderate-tail behavior. It is not an extreme-quantile safety guarantee. If a hospital, bank, or legal workflow needs evidence about one-in-ten-thousand failures, ReasonBENCH is a starting philosophy, not a sufficient audit.

The paper also does not fully identify the mechanistic origin of every divergent run. It tests candidate sources: temperature, prompt/parser specification, evaluator quality, reasoning effort, and model scale. But provider infrastructure, hardware nondeterminism, routing behavior, training history, RLHF effects, and other hidden system details remain outside the intervention set.

Finally, Humanity’s Last Exam uses an LLM-as-judge evaluation protocol. The authors note that judge stochasticity may contribute to measured Run Noise. That does not invalidate the finding, but it reminds us that evaluation systems can have their own variance. The ruler can wobble too. Wonderful, exactly what we needed.

The useful lesson is not pessimism; it is instrumentation

ReasonBENCH is easy to summarize as “LLMs are unstable.” That summary is true but too blunt. The paper’s more useful message is that instability can be measured, decomposed, and used as a design variable.

A reasoning system is not just a model. It is a model, a strategy, a prompt, an evaluator, a parser, a task environment, a cost structure, and a provider execution substrate. The benchmark score is only the visible receipt. The operational system is the distribution behind it.

For AI teams, this changes the evaluation culture. A leaderboard tells you who looked good in a run. A distribution tells you who is likely to behave under repetition. A cost-quality joint failure rate tells you who may embarrass both the product team and the finance team at the same time. That last one deserves its own dashboard, preferably before deployment rather than after the invoice.

The central misconception is that reliability follows naturally from higher single-run scores, lower temperature, cleaner prompts, or more reasoning effort. ReasonBENCH pushes back. Higher scores help, but they do not erase variance. Lower temperature helps less than many expect. Cleaner prompts fix bias more than instability. More reasoning effort can increase cost without buying reliability. Strategies and models are separate levers. Cost is part of risk. Evaluation needs error bars.

The algorithmic mind, it turns out, needs not just benchmarks but error bars. Not because error bars are elegant. Because without them, the system may look stable only because nobody asked it the same question twice.

Cognaptus: Automate the Present, Incubate the Future.

Nearchos Potatmitis, Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Lars Klein, and Akhil Arora, “ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning,” arXiv:2512.07795v2, 30 May 2026, https://arxiv.org/abs/2512.07795. ↩︎

The benchmark score is a mean pretending to be a personality#

ReasonBENCH separates structural unevenness from rerun instability#

The winners win, but not always securely#

Most variance is structured, but the leftover still hurts#

The convenient fixes mostly fail#

Models and strategies are different axes, not two names for the same lever#

Cost is a reliability variable, not a procurement footnote#

How an applied team should change its evaluation protocol#

Where the paper’s conclusions should not be overextended#

The useful lesson is not pessimism; it is instrumentation#