Zero Degrees, Still Feverish: Why Deterministic AI Needs a Thermometer

Opening — Why this matters now

The comforting myth of enterprise AI is that setting an LLM’s temperature to zero makes it deterministic. A nice little checkbox. A procedural sedative. Press it, and the machine behaves.

The paper Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models is useful because it attacks that myth directly. Its central claim is not that LLMs are chaotic by nature. That would be dramatic, and therefore probably a conference keynote. The claim is sharper: even when a model is asked to decode at $T = 0$, the surrounding inference environment can introduce enough tiny numerical variation to produce divergent outputs.¹

That matters because modern AI deployment is moving from “generate a nice paragraph” to operational workflows: compliance review, customer escalation, claims processing, financial document extraction, contract triage, medical intake, audit support, and agentic systems that make multi-step decisions. In those settings, “the answer usually looks similar” is not a control standard. It is a shrug wearing a blazer.

The authors propose a name for this hidden randomness: background temperature, written as $T_{bg}$. It is the effective randomness induced by the implementation stack even when the user has nominally selected zero temperature. In other words, the model may be set to cold, but the infrastructure may still be sweating.

Background — Context and prior art

The usual explanation of LLM randomness is simple. At each generation step, the model assigns probabilities to possible next tokens. Temperature modifies how sharply or loosely the model samples from those probabilities. Low temperature concentrates probability mass around the most likely tokens. Higher temperature spreads probability mass and increases variation.

At $T = 0$, practitioners usually expect greedy decoding: pick the most probable next token every time. If the same prompt enters the same model, the same output should come out. That expectation is convenient. It is also incomplete.

The paper builds on recent work showing that nondeterminism can arise from the inference environment itself: batch-size variation, co-batching with other requests, kernel non-invariance, floating-point non-associativity, reduction ordering, precision choices, hardware differences, and concurrency.² These are not philosophical defects. They are engineering details — the sort that sit quietly in the basement until the audit committee asks why two identical requests produced two different answers.

Prior studies have already observed nondeterminism under supposedly deterministic settings. The paper cites work reporting sizable accuracy variation across repeated runs, evaluation instability under greedy decoding, and nondeterministic code-generation behavior even when temperature is set to zero.³ What Messina and Scotta add is not merely another complaint about reproducibility. They offer a formal wrapper: treat implementation-induced variation as if it were equivalent to running a stable reference system at a small nonzero temperature.

That is the conceptual move. Instead of saying, “the model is sometimes inconsistent,” the paper asks: how much temperature-like randomness does the deployment stack add?

Analysis — What the paper does

The authors start with the standard token-generation setup. A model produces logits, converts them into probabilities with softmax, and selects or samples tokens. Temperature is represented as a transformation on the probability distribution. Ideally, when $T = 0$, the transformation behaves like an identity operation before greedy selection.

But real inference systems are not ideal mathematical objects. The authors denote the inference environment as $I$: the complete operational context of execution, including batch composition, hardware, backend, precision, kernel choices, concurrency, and numeric reduction order. They then model an implementation-dependent perturbation, $_I$, which changes the effective probability distribution even when nominal temperature is zero.

The key definition is:

$$ T_{bg} \triangleq \mathbb{E}_{I \in \mathcal{I}}[T_n(I)] $$

Here, $T_n(I)$ is the equivalent temperature: the nonzero temperature that an ideal reference system would need in order to exhibit the same kind of output variability caused by the actual environment $I$ at nominal $T = 0$.

That sounds abstract, so here is the operational interpretation:

Concept	Plain-English meaning	Business interpretation
Nominal temperature	The temperature parameter the user sets	“We configured the model to be deterministic.”
Inference environment $I$	Runtime conditions: batching, hardware, kernels, concurrency, precision	“What actually happened inside the serving stack.”
Equivalent temperature $T_n(I)$	The temperature that would mimic the observed variability in a stable reference setup	“How random the stack behaved, in temperature units.”
Background temperature $T_{bg}$	Expected equivalent temperature across environments	“The hidden randomness budget of the deployment.”

The important point is not that temperature literally causes the variation. It does not. The point is that temperature becomes a common measurement scale for otherwise messy implementation effects. That is useful because businesses understand calibrated risk better than vague warnings.

The paper’s proposed measurement protocol has four steps.

Step	What to do	Why it matters
1. Build a prompt set	Use general, task-specific, adversarial, long-context, rare-token, and synthetic near-tie prompts	Some prompts are stable; others expose small perturbations quickly.
2. Run repeated zero-temperature inference	Repeat each prompt many times under the system being tested	This reveals whether “same input, same settings” actually produces the same output.
3. Build reference distributions	Run a stable local or quasi-ideal reference model at known temperatures	This creates a calibration curve between temperature and observed variation.
4. Fit equivalent temperature	Find the reference temperature whose variability distribution best matches the tested system	This produces an estimate of hidden randomness.

The authors suggest metrics such as exact-match rate, first-divergence token index, edit distance, distributional divergence over token probabilities, and entropy. The pilot experiments mostly use exact-match fraction, which is deliberately simple. It is not the final metric civilization deserves, but it is the one available before lunch.

Findings — Results with visualization

The paper’s pilot study estimates $T_{bg}$ for several provider-hosted LLMs by comparing repeated $T = 0$ outputs against reference runs from local open models.

The first experiment uses gpt-4.1-nano accessed through Microsoft Azure AI services. The prompt set consists of the first 200 questions from TruthfulQA. The reference model is SmolLM3-3B, run at a grid of temperatures from 0 to 1. For each temperature and prompt, the authors generate 32 responses limited to 32 tokens and compute the maximum fraction of identical answers. Then they run the tested model 100 times per prompt at $T = 0$ and compare distributions using Kolmogorov–Smirnov distance.

In the first reference setup, the closest match for gpt-4.1-nano is $T = 0.05$. After adding a second reference model, Llama-3.2-3B-Instruct, the estimate becomes $0.075$, because the SmolLM reference gives $0.05$ and the Llama reference gives $0.10$.⁴

The paper then extends the same logic to three additional models, using the first 30 prompts and the same style of repeated zero-temperature testing.

Tested model	Reference estimate using SmolLM3-3B	Reference estimate using Llama-3.2-3B-Instruct	Average estimated $T_{bg}$
grok-3-mini	0.01	0.02	0.015
gemini-2.0-flash	0.05	0.08	0.065
gpt-4.1-nano	0.05	0.10	0.075
claude-sonnet-4	0.00	0.00	0.000

This table should not be read as a universal model ranking. The authors are careful that the results are pilot experiments. Prompt selection is narrow, output length is capped, the tested provider environments are not controlled, and the metric is exact-match based. The Claude result, for example, means that under this particular setup its outputs were identical across the tested prompts. It does not mean “Claude has solved determinism” and should now be issued a small marble statue.

The more valuable finding is methodological: hidden randomness can be measured, compared, and governed. That turns nondeterminism from an anecdote into an operational variable.

A useful way to visualize the paper’s contribution is as a shift in AI evaluation maturity:

Evaluation maturity level	Typical practice	Problem	Upgrade suggested by the paper
Level 1: Single-shot demo	Run one prompt once	No reproducibility signal	Repeat prompts across runs
Level 2: Static benchmark	Report average score	Ignores run-to-run variation	Report variance and exact-match stability
Level 3: Deterministic setting	Set $T = 0$ and assume stability	Confuses configuration with behavior	Measure observed variability
Level 4: Environment-aware evaluation	Test across load, batching, hardware, regions	More expensive, but real	Estimate $T_n(I)$ by environment
Level 5: Governed deployment	Monitor hidden randomness as part of model assurance	Requires standards and tooling	Report $T_{bg}$ alongside accuracy, latency, and cost

This is where the paper becomes business-relevant. Accuracy tells you how good the model is on average. Latency tells you how fast it responds. Cost tells you how painful the invoice will be. Background temperature tells you whether the same system will behave consistently enough to be trusted in repeatable workflows.

Implementation — How this becomes an enterprise control

For business deployments, $T_{bg}$ should be treated less like an academic curiosity and more like a model operations control metric.

Consider a document-classification workflow. A customer email arrives. The model decides whether it is a billing dispute, cancellation request, technical issue, or regulatory complaint. If repeated runs under the same input sometimes route the message differently, the problem is not just “LLM creativity.” It is operational instability. The downstream effects include SLA breaches, inconsistent customer treatment, duplicate reviews, wrong escalation paths, and unreliable management reporting.

The same applies to agentic workflows. Agents compound small differences. A slight output variation in Step 1 can change retrieval in Step 2, tool calls in Step 3, and final recommendations in Step 4. Background temperature is therefore not only a generation-quality issue. It is a workflow branching-risk issue.

A practical Cognaptus-style implementation would look like this:

Control layer	Practical mechanism	Example metric
Prompt stability testing	Run repeated prompts across representative cases	Exact-match rate, semantic-match rate
Environment sampling	Test under low load, high load, region variation, and batch variation	$T_n(I)$ by runtime condition
Reference calibration	Use stable local models or controlled model variants	Equivalent temperature curve
Workflow sensitivity mapping	Identify which output fields cause downstream branching	Branch-flip rate
Human review threshold	Route unstable cases to review	Low-confidence + high-variance trigger
Vendor assurance	Ask providers for reproducibility and serving-stack controls	Reported variance under $T = 0$

This matters particularly in regulated or quasi-regulated workflows:

Use case	Why hidden randomness matters	Recommended control
Compliance memo drafting	Same evidence should not produce materially different risk wording	Store repeated-run variance for audit samples
Invoice coding	Different category assignment affects accounting records	Use deterministic post-rules and exception queues
Customer complaint triage	Inconsistent routing changes response time and liability	Monitor route-flip frequency
Contract clause extraction	Missed or changed clause classification alters legal review	Require structured extraction with validation rules
AI agent task planning	Small differences cascade across tool calls	Freeze plans before execution and log branch decisions

There is an uncomfortable implication here: “temperature = 0” is not a governance control. It is merely a configuration input. Governance must measure realized behavior.

Implications — Next steps and significance

The paper has three implications for AI deployment.

First, determinism should be reported, not assumed. Model documentation should not stop at accuracy, context length, latency, and price. For serious deployments, reproducibility metrics belong in the same family as uptime and error rate. A provider saying “use temperature zero” is not enough. The relevant question is: under what batching, hardware, load, and precision conditions was output stability tested?

Second, evaluation margins need uncertainty margins. If model A outperforms model B by a tiny benchmark difference, but implementation-induced variability is larger than that difference, the claimed improvement may be noise wearing a lab coat. This is especially important for procurement, vendor comparisons, and internal model upgrades.

Third, architecture choices affect governance risk. Batch-invariant kernels, deterministic reductions, fixed precision, stable hardware configurations, and concurrency controls are not merely infrastructure preferences. They can reduce the hidden randomness of the application. That means AI assurance is no longer only a policy problem; it is also a systems-engineering problem.

The paper also leaves open several hard questions.

Open issue	Why it remains difficult
Reference-model dependence	Different reference LLMs respond differently to temperature, so $T_{bg}$ estimates depend on calibration choices.
Prompt-set dependence	A safe-looking prompt set may miss edge cases where small logit perturbations flip outputs.
Metric dependence	Exact string match is simple but may overstate or understate business impact. Semantic equivalence matters.
Provider opacity	Remote APIs rarely expose batch composition, hardware, kernel choices, or deployment region internals.
Drift over time	A provider may silently update infrastructure, changing background temperature without changing the model name.

These limitations do not weaken the paper’s usefulness. They define the next layer of tooling. The industry does not need mystical confidence in deterministic AI. It needs boring, measured, repeatable controls. Boring is underrated. Boring is how airplanes land.

Conclusion — A thermometer for AI operations

The paper’s best contribution is conceptual discipline. It gives practitioners a way to talk about a real deployment problem without hand-waving. Background temperature reframes “LLMs are sometimes inconsistent” into a measurable property of the inference stack.

For businesses, the lesson is straightforward: do not confuse a model parameter with an operational guarantee. A zero-temperature setting may reduce randomness, but it does not automatically eliminate environment-induced variability. In high-stakes workflows, the correct posture is not faith. It is measurement, calibration, logging, and escalation.

The next generation of AI governance will not only ask whether a model is accurate. It will ask whether the system is stable enough to repeat itself when repetition matters.

And yes, that means enterprise AI now needs a thermometer. Apparently even machines can run a low-grade fever.

Cognaptus: Automate the Present, Incubate the Future.

Alberto Messina and Stefano Scotta, “Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models,” arXiv:2604.22411, 2026. https://arxiv.org/abs/2604.22411 ↩︎
The paper specifically builds on Horace He and Thinking Machines Lab, “Defeating nondeterminism in LLM inference,” cited by the authors as a systems-level account of batch-size effects, batch-invariant kernels, and floating-point non-associativity. ↩︎
The paper discusses related work including Berk Atil et al. on nondeterminism under deterministic LLM settings, Yifan Song et al. on evaluation instability, and Shuyin Ouyang et al. on nondeterminism in ChatGPT code generation. ↩︎
In the pilot, SmolLM3-3B and Llama-3.2-3B-Instruct are used as reference models. The authors compare exact-match fraction distributions using Kolmogorov–Smirnov distance. ↩︎

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper does#

Findings — Results with visualization#

Implementation — How this becomes an enterprise control#

Implications — Next steps and significance#

Conclusion — A thermometer for AI operations#