Opening — Why this matters now
The comforting myth of enterprise AI is that setting an LLM’s temperature to zero makes it deterministic. A nice little checkbox. A procedural sedative. Press it, and the machine behaves.
The paper Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models is useful because it attacks that myth directly. Its central claim is not that LLMs are chaotic by nature. That would be dramatic, and therefore probably a conference keynote. The claim is sharper: even when a model is asked to decode at $T = 0$, the surrounding inference environment can introduce enough tiny numerical variation to produce divergent outputs.1
That matters because modern AI deployment is moving from “generate a nice paragraph” to operational workflows: compliance review, customer escalation, claims processing, financial document extraction, contract triage, medical intake, audit support, and agentic systems that make multi-step decisions. In those settings, “the answer usually looks similar” is not a control standard. It is a shrug wearing a blazer.
The authors propose a name for this hidden randomness: background temperature, written as $T_{bg}$. It is the effective randomness induced by the implementation stack even when the user has nominally selected zero temperature. In other words, the model may be set to cold, but the infrastructure may still be sweating.
Background — Context and prior art
The usual explanation of LLM randomness is simple. At each generation step, the model assigns probabilities to possible next tokens. Temperature modifies how sharply or loosely the model samples from those probabilities. Low temperature concentrates probability mass around the most likely tokens. Higher temperature spreads probability mass and increases variation.
At $T = 0$, practitioners usually expect greedy decoding: pick the most probable next token every time. If the same prompt enters the same model, the same output should come out. That expectation is convenient. It is also incomplete.
The paper builds on recent work showing that nondeterminism can arise from the inference environment itself: batch-size variation, co-batching with other requests, kernel non-invariance, floating-point non-associativity, reduction ordering, precision choices, hardware differences, and concurrency.2 These are not philosophical defects. They are engineering details — the sort that sit quietly in the basement until the audit committee asks why two identical requests produced two different answers.
Prior studies have already observed nondeterminism under supposedly deterministic settings. The paper cites work reporting sizable accuracy variation across repeated runs, evaluation instability under greedy decoding, and nondeterministic code-generation behavior even when temperature is set to zero.3 What Messina and Scotta add is not merely another complaint about reproducibility. They offer a formal wrapper: treat implementation-induced variation as if it were equivalent to running a stable reference system at a small nonzero temperature.
That is the conceptual move. Instead of saying, “the model is sometimes inconsistent,” the paper asks: how much temperature-like randomness does the deployment stack add?
Analysis — What the paper does
The authors start with the standard token-generation setup. A model produces logits, converts them into probabilities with softmax, and selects or samples tokens. Temperature is represented as a transformation on the probability distribution. Ideally, when $T = 0$, the transformation behaves like an identity operation before greedy selection.
But real inference systems are not ideal mathematical objects. The authors denote the inference environment as $I$: the complete operational context of execution, including batch composition, hardware, backend, precision, kernel choices, concurrency, and numeric reduction order. They then model an implementation-dependent perturbation, $_I$, which changes the effective probability distribution even when nominal temperature is zero.
The key definition is:
$$ T_{bg} \triangleq \mathbb{E}_{I \in \mathcal{I}}[T_n(I)] $$
Here, $T_n(I)$ is the equivalent temperature: the nonzero temperature that an ideal reference system would need in order to exhibit the same kind of output variability caused by the actual environment $I$ at nominal $T = 0$.
That sounds abstract, so here is the operational interpretation:
| Concept | Plain-English meaning | Business interpretation |
|---|---|---|
| Nominal temperature | The temperature parameter the user sets | “We configured the model to be deterministic.” |
| Inference environment $I$ | Runtime conditions: batching, hardware, kernels, concurrency, precision | “What actually happened inside the serving stack.” |
| Equivalent temperature $T_n(I)$ | The temperature that would mimic the observed variability in a stable reference setup | “How random the stack behaved, in temperature units.” |
| Background temperature $T_{bg}$ | Expected equivalent temperature across environments | “The hidden randomness budget of the deployment.” |
The important point is not that temperature literally causes the variation. It does not. The point is that temperature becomes a common measurement scale for otherwise messy implementation effects. That is useful because businesses understand calibrated risk better than vague warnings.
The paper’s proposed measurement protocol has four steps.
| Step | What to do | Why it matters |
|---|---|---|
| 1. Build a prompt set | Use general, task-specific, adversarial, long-context, rare-token, and synthetic near-tie prompts | Some prompts are stable; others expose small perturbations quickly. |
| 2. Run repeated zero-temperature inference | Repeat each prompt many times under the system being tested | This reveals whether “same input, same settings” actually produces the same output. |
| 3. Build reference distributions | Run a stable local or quasi-ideal reference model at known temperatures | This creates a calibration curve between temperature and observed variation. |
| 4. Fit equivalent temperature | Find the reference temperature whose variability distribution best matches the tested system | This produces an estimate of hidden randomness. |
The authors suggest metrics such as exact-match rate, first-divergence token index, edit distance, distributional divergence over token probabilities, and entropy. The pilot experiments mostly use exact-match fraction, which is deliberately simple. It is not the final metric civilization deserves, but it is the one available before lunch.
Findings — Results with visualization
The paper’s pilot study estimates $T_{bg}$ for several provider-hosted LLMs by comparing repeated $T = 0$ outputs against reference runs from local open models.
The first experiment uses gpt-4.1-nano accessed through Microsoft Azure AI services. The prompt set consists of the first 200 questions from TruthfulQA. The reference model is SmolLM3-3B, run at a grid of temperatures from 0 to 1. For each temperature and prompt, the authors generate 32 responses limited to 32 tokens and compute the maximum fraction of identical answers. Then they run the tested model 100 times per prompt at $T = 0$ and compare distributions using Kolmogorov–Smirnov distance.
In the first reference setup, the closest match for gpt-4.1-nano is $T = 0.05$. After adding a second reference model, Llama-3.2-3B-Instruct, the estimate becomes $0.075$, because the SmolLM reference gives $0.05$ and the Llama reference gives $0.10$.4
The paper then extends the same logic to three additional models, using the first 30 prompts and the same style of repeated zero-temperature testing.
| Tested model | Reference estimate using SmolLM3-3B | Reference estimate using Llama-3.2-3B-Instruct | Average estimated $T_{bg}$ |
|---|---|---|---|
| grok-3-mini | 0.01 | 0.02 | 0.015 |
| gemini-2.0-flash | 0.05 | 0.08 | 0.065 |
| gpt-4.1-nano | 0.05 | 0.10 | 0.075 |
| claude-sonnet-4 | 0.00 | 0.00 | 0.000 |
This table should not be read as a universal model ranking. The authors are careful that the results are pilot experiments. Prompt selection is narrow, output length is capped, the tested provider environments are not controlled, and the metric is exact-match based. The Claude result, for example, means that under this particular setup its outputs were identical across the tested prompts. It does not mean “Claude has solved determinism” and should now be issued a small marble statue.
The more valuable finding is methodological: hidden randomness can be measured, compared, and governed. That turns nondeterminism from an anecdote into an operational variable.
A useful way to visualize the paper’s contribution is as a shift in AI evaluation maturity:
| Evaluation maturity level | Typical practice | Problem | Upgrade suggested by the paper |
|---|---|---|---|
| Level 1: Single-shot demo | Run one prompt once | No reproducibility signal | Repeat prompts across runs |
| Level 2: Static benchmark | Report average score | Ignores run-to-run variation | Report variance and exact-match stability |
| Level 3: Deterministic setting | Set $T = 0$ and assume stability | Confuses configuration with behavior | Measure observed variability |
| Level 4: Environment-aware evaluation | Test across load, batching, hardware, regions | More expensive, but real | Estimate $T_n(I)$ by environment |
| Level 5: Governed deployment | Monitor hidden randomness as part of model assurance | Requires standards and tooling | Report $T_{bg}$ alongside accuracy, latency, and cost |
This is where the paper becomes business-relevant. Accuracy tells you how good the model is on average. Latency tells you how fast it responds. Cost tells you how painful the invoice will be. Background temperature tells you whether the same system will behave consistently enough to be trusted in repeatable workflows.
Implementation — How this becomes an enterprise control
For business deployments, $T_{bg}$ should be treated less like an academic curiosity and more like a model operations control metric.
Consider a document-classification workflow. A customer email arrives. The model decides whether it is a billing dispute, cancellation request, technical issue, or regulatory complaint. If repeated runs under the same input sometimes route the message differently, the problem is not just “LLM creativity.” It is operational instability. The downstream effects include SLA breaches, inconsistent customer treatment, duplicate reviews, wrong escalation paths, and unreliable management reporting.
The same applies to agentic workflows. Agents compound small differences. A slight output variation in Step 1 can change retrieval in Step 2, tool calls in Step 3, and final recommendations in Step 4. Background temperature is therefore not only a generation-quality issue. It is a workflow branching-risk issue.
A practical Cognaptus-style implementation would look like this:
| Control layer | Practical mechanism | Example metric |
|---|---|---|
| Prompt stability testing | Run repeated prompts across representative cases | Exact-match rate, semantic-match rate |
| Environment sampling | Test under low load, high load, region variation, and batch variation | $T_n(I)$ by runtime condition |
| Reference calibration | Use stable local models or controlled model variants | Equivalent temperature curve |
| Workflow sensitivity mapping | Identify which output fields cause downstream branching | Branch-flip rate |
| Human review threshold | Route unstable cases to review | Low-confidence + high-variance trigger |
| Vendor assurance | Ask providers for reproducibility and serving-stack controls | Reported variance under $T = 0$ |
This matters particularly in regulated or quasi-regulated workflows:
| Use case | Why hidden randomness matters | Recommended control |
|---|---|---|
| Compliance memo drafting | Same evidence should not produce materially different risk wording | Store repeated-run variance for audit samples |
| Invoice coding | Different category assignment affects accounting records | Use deterministic post-rules and exception queues |
| Customer complaint triage | Inconsistent routing changes response time and liability | Monitor route-flip frequency |
| Contract clause extraction | Missed or changed clause classification alters legal review | Require structured extraction with validation rules |
| AI agent task planning | Small differences cascade across tool calls | Freeze plans before execution and log branch decisions |
There is an uncomfortable implication here: “temperature = 0” is not a governance control. It is merely a configuration input. Governance must measure realized behavior.
Implications — Next steps and significance
The paper has three implications for AI deployment.
First, determinism should be reported, not assumed. Model documentation should not stop at accuracy, context length, latency, and price. For serious deployments, reproducibility metrics belong in the same family as uptime and error rate. A provider saying “use temperature zero” is not enough. The relevant question is: under what batching, hardware, load, and precision conditions was output stability tested?
Second, evaluation margins need uncertainty margins. If model A outperforms model B by a tiny benchmark difference, but implementation-induced variability is larger than that difference, the claimed improvement may be noise wearing a lab coat. This is especially important for procurement, vendor comparisons, and internal model upgrades.
Third, architecture choices affect governance risk. Batch-invariant kernels, deterministic reductions, fixed precision, stable hardware configurations, and concurrency controls are not merely infrastructure preferences. They can reduce the hidden randomness of the application. That means AI assurance is no longer only a policy problem; it is also a systems-engineering problem.
The paper also leaves open several hard questions.
| Open issue | Why it remains difficult |
|---|---|
| Reference-model dependence | Different reference LLMs respond differently to temperature, so $T_{bg}$ estimates depend on calibration choices. |
| Prompt-set dependence | A safe-looking prompt set may miss edge cases where small logit perturbations flip outputs. |
| Metric dependence | Exact string match is simple but may overstate or understate business impact. Semantic equivalence matters. |
| Provider opacity | Remote APIs rarely expose batch composition, hardware, kernel choices, or deployment region internals. |
| Drift over time | A provider may silently update infrastructure, changing background temperature without changing the model name. |
These limitations do not weaken the paper’s usefulness. They define the next layer of tooling. The industry does not need mystical confidence in deterministic AI. It needs boring, measured, repeatable controls. Boring is underrated. Boring is how airplanes land.
Conclusion — A thermometer for AI operations
The paper’s best contribution is conceptual discipline. It gives practitioners a way to talk about a real deployment problem without hand-waving. Background temperature reframes “LLMs are sometimes inconsistent” into a measurable property of the inference stack.
For businesses, the lesson is straightforward: do not confuse a model parameter with an operational guarantee. A zero-temperature setting may reduce randomness, but it does not automatically eliminate environment-induced variability. In high-stakes workflows, the correct posture is not faith. It is measurement, calibration, logging, and escalation.
The next generation of AI governance will not only ask whether a model is accurate. It will ask whether the system is stable enough to repeat itself when repetition matters.
And yes, that means enterprise AI now needs a thermometer. Apparently even machines can run a low-grade fever.
Cognaptus: Automate the Present, Incubate the Future.
-
Alberto Messina and Stefano Scotta, “Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models,” arXiv:2604.22411, 2026. https://arxiv.org/abs/2604.22411 ↩︎
-
The paper specifically builds on Horace He and Thinking Machines Lab, “Defeating nondeterminism in LLM inference,” cited by the authors as a systems-level account of batch-size effects, batch-invariant kernels, and floating-point non-associativity. ↩︎
-
The paper discusses related work including Berk Atil et al. on nondeterminism under deterministic LLM settings, Yifan Song et al. on evaluation instability, and Shuyin Ouyang et al. on nondeterminism in ChatGPT code generation. ↩︎
-
In the pilot, SmolLM3-3B and Llama-3.2-3B-Instruct are used as reference models. The authors compare exact-match fraction distributions using Kolmogorov–Smirnov distance. ↩︎