Entropy, My Dear Watson: Finding Hallucinations in the Shape of Uncertainty

A customer-support bot gives a fluent answer. The grammar is clean, the tone is helpful, and the confidence is offensively calm. Then someone checks the underlying fact and discovers the answer is wrong.

The old operating question was: Was the model confident? The better question is: What did the model’s uncertainty look like while it was speaking?

That is the useful shift in “Entropy Distribution as a Fingerprint for Hallucinations in Generative Models,” an arXiv paper by Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Romina Yalovetzky, and Niraj Kumar.¹ The paper does not simply add another score to the already crowded hallucination-detection shelf. It argues that the distribution of token-level entropy carries a detectable fingerprint of hallucination, including information that average entropy, perplexity, and length-normalised entropy throw away.

That distinction matters. Perplexity asks whether the generation was uncertain on average. CES, the paper’s proposed Calibrated Entropy Score, asks whether the sequence of token uncertainties looks statistically consistent with the uncertainty trace observed in faithful outputs. Same raw material. More disciplined use of it. A small mercy in a field that often calls a dashboard a control system.

The mechanism starts with a trace, not a verdict

When an autoregressive language model generates text, it repeatedly predicts the next token. At each step, the model has a probability distribution over possible next tokens. If that distribution is sharply concentrated, entropy is low. If the probability mass is spread out, entropy is high.

For a generation of length $T$, the model produces an entropy sequence:

$$ h_{1:T} = (h_1, h_2, \ldots, h_T) $$

A conventional scalar method compresses this sequence immediately. Perplexity and length-normalised entropy mainly behave like mean-style summaries. They ask whether the answer was generally easy or hard for the model to produce.

The paper’s central observation is that this compression is wasteful. Hallucinated generations may differ from faithful ones not only because their entropy is higher on average, but because the shape of the entropy trace changes. There may be abnormal tails, sudden uncertainty spikes, or a different distributional profile across tokens.

That is the first contribution: hallucination is treated not as a single confidence number, but as a shift in the statistical behaviour of the generation process.

A practical analogy: an average temperature reading may tell you a machine is running hot. A temperature trace tells you whether the machine has periodic spikes, drift, instability, or a single dangerous surge. The average is useful. It is not the whole diagnostic.

CES turns token uncertainty into a calibrated control signal

The method has two stages.

First, build a reference distribution from faithful generations. During offline calibration, an oracle labels whether a generated output is hallucinated or faithful. The oracle can be a human expert, a task-specific evaluator, or an LLM judge. From the faithful outputs, CES pools token-level entropy values and constructs an empirical cumulative distribution function, or ECDF.

Second, for a new generation, compute the entropy trace and compare it against that reference distribution. The paper’s chosen CES statistic combines two pieces of information:

the calibrated position of the mean entropy;
the calibrated position of the maximum entropy.

In simplified notation, for entropy sequence $h_{1:T}$ and reference ECDF $\hat F$, the score can be read as:

$$ \mathrm{CES}(h_{1:T}) = \sqrt{\hat F(\bar h) \cdot \hat F(\max_t h_t)} $$

where $\bar h$ is the mean entropy of the generation.

The formula is deliberately modest. It does not ask the model to explain itself. It does not run five more generations and measure semantic disagreement. It does not inspect hidden states. It takes the token probabilities already produced during generation, maps mean and tail uncertainty into a calibrated reference scale, and combines them.

That is why the “calibrated” part matters. Raw entropy values are not naturally comparable across models, tasks, and generation settings. An entropy value that looks unusual for one model-task pair may be ordinary for another. The reference CDF converts the statistic into a relative position against the model’s own faithful-output behaviour.

The output is not truth. It is a risk signal. That difference is not cosmetic; it is the difference between an operational control and a tiny oracle costume.

The paper’s evidence first establishes the fingerprint

Before selling CES as a detector, the paper asks a more basic question: are hallucinated and faithful entropy distributions actually different?

Across 80 model-dataset experiments, the authors pool token-level entropies from faithful and hallucinated generations and compare the distributions using two-sample Kolmogorov-Smirnov tests. The main distributional test finds significant separation in 72 of 80 experiments, or 90%. The reported median KS distance is 0.100, with an interquartile range of 0.059 to 0.130. The median Cohen’s $d$ is 0.192, with an interquartile range of 0.106 to 0.284.

That is not a giant effect. It is a small but persistent effect. This is exactly the kind of result that is boring in a press release and useful in production engineering.

The more important experiment is the mean-centred test. The authors remove the mean difference from the entropy sequences and test whether the remaining distributional shape still separates hallucinated from faithful outputs. It does: the paper reports significant residual shape differences in 80 of 80 experiments.

This is the point readers can easily miss. CES is not merely “average entropy with better branding.” The paper’s argument depends on the claim that the distributional shape and the tail carry independent information. If the mean-centred result had failed, CES would look much more like a polite rearrangement of perplexity. It did not fail.

The detector is lightweight because it avoids the expensive habits of stronger baselines

The hallucination-detection literature has a familiar cost ladder.

At the cheap end are scalar uncertainty methods: perplexity, length-normalised entropy, generation length, and related heuristics. They are computationally attractive but blunt.

In the middle are elicitation methods, where the model is prompted to evaluate its own answer. These need extra inference calls and inherit the model’s own self-assessment failure modes. Asking a model whether it is hallucinating is sometimes useful. It is also a bit like asking a drunk driver to grade the lane discipline.

At the expensive end are multi-sample and semantic-consistency methods. Semantic Entropy, KLE, EigenScore, SelfCheckGPT, and related approaches draw multiple generations and measure disagreement or semantic spread. These methods can be strong, but they pay for it with repeated forward passes, supporting models, entailment classifiers, embeddings, or additional API calls.

CES is positioned differently. It is a single-pass detector using token logits or token log-probabilities. For open-weight models, full logits may be available. For API models, the paper tests whether top-logprob-derived entropy is enough. The appendix reports that CES generalises to API settings, with a median AUROC of 0.669 for API models versus 0.642 for open-weight models, though the authors also note that API models hallucinated less often in their setup.

The operational proposition is therefore specific:

Method family	Typical signal	Operational cost	What CES tries to improve
Perplexity / length-normalised entropy	Mean-style token uncertainty	Single pass	Retain shape and tail information
Elicitation methods	Model self-evaluation	Extra prompts	Avoid asking the model to audit itself
Multi-sample semantic methods	Disagreement across generations	Multiple generations plus support models	Approach their performance with much lower inference cost
Hidden-state methods	Internal representations	Requires model internals	Work with black-box logit or logprob access

For a high-volume enterprise workflow, that cost profile is not a footnote. If a detector requires five to ten extra generations per answer, it may be tolerable for offline audits and intolerable for live customer operations. CES belongs in the class of controls that could plausibly run inline, assuming the serving stack exposes enough probability information.

The benchmark result is competitive, not magical

The paper evaluates CES across eight QA benchmarks: BioASQ, CoQA, DROP, GSM8K, NQ-Open, SQuAD, SVAMP, and TriviaQA. It uses ten generator models across open-weight and API-access families, producing 80 model-dataset cells. Each experiment uses 500 generated responses. Hallucination labels are assigned through an LLM-as-judge procedure, with additional robustness checks using deterministic overlap-style metrics.

The headline comparison is strong but should be read carefully.

The unsupervised version of CES wins 854 out of 1,279 pairwise comparisons against 16 benchmark methods, a 66.8% win rate. It beats 12 of 16 methods by majority win rate. In Friedman/Nemenyi rank analysis, unsupervised CES reaches an average rank of 6.29 and belongs to the top statistical clique, statistically indistinguishable from the best-ranked KLE variants. The paper also reports that CES is significantly better than several cheaper baselines, including length-based and mean-entropy-style approaches.

This does not mean CES dominates every method everywhere. The paper is clear that CES loses or is not significantly different against several strong multi-sample or embedding-based methods. Embedding Regression records many first-place finishes. KLE variants remain highly competitive. On some model-dataset pairs, the best method is not CES.

The meaningful claim is narrower and more useful: among methods with low inference-time access requirements, CES is unusually competitive. It sits near multi-sample methods while avoiding their repeated-generation cost.

That is a business-relevant result because deployment rarely optimises only AUROC. It optimises AUROC under latency, cost, privacy, model-access, and engineering constraints. A detector that is one statistical clique below the glamour model but ten times cheaper may be the one that actually ships.

The appendix is mostly robustness, not a second thesis

The paper’s appendix is unusually important because it tells us which results are central and which are support beams.

Paper component	Likely purpose	What it supports	What it does not prove
KS separation across faithful vs hallucinated entropy distributions	Main evidence	Entropy traces differ distributionally	That CES alone solves factuality
Mean-centred KS tests	Mechanism validation	Shape signal exists beyond average entropy	That shape is always large enough for reliable detection
Benchmark comparison against 16 methods	Performance comparison	CES is competitive under lower access cost	Universal dominance across all datasets
44-statistic ablation	Ablation	Mean-plus-max geometric CES is a strong design choice	That no richer distributional statistic could outperform it
Calibration contamination test	Robustness test	CES remains stable when calibration is polluted by hallucinated tokens	That calibration quality never matters in real domains
Noisy judgment simulation	Robustness test	Calibration labels can be noisy without large measured degradation	That LLM judges are always acceptable truth sources
API vs open-weight comparison	Deployment-relevance test	CES can work when only API logprobs are available	That every commercial API exposes sufficient logprob detail
Synthetic length/error-bound experiment	Theory verification under assumptions	Error decreases with generation length under i.i.d. sampling	That the same decay appears unchanged in natural short QA outputs

Several appendix results are especially useful for practitioners.

First, the 44-statistic ablation checks whether the chosen CES formula is arbitrary. The authors test combinations of mean, median, max, and other summary statistics under arithmetic and geometric aggregation. The selected mean-plus-max geometric version achieves the best average rank among the tested variants and a reported median AUROC around 0.649 in that ablation setup. This supports the design choice: CES is not just a decorative formula chosen because it looked nice on a slide.

Second, the calibration contamination test is practically interesting. The supervised CES version normally builds its reference ECDF from faithful generations. The authors simulate replacing part of that calibration pool with hallucinated tokens. Even at 50% contamination, median AUROC remains essentially unchanged: 0.6531 at 0% contamination versus 0.6533 at 50% contamination. The authors interpret this as evidence that the faithful and hallucinated entropy distributions overlap enough that contamination broadens the reference but does not destroy ranking power.

Third, the noisy-label test flips calibration labels at different rates. The paper reports that even at 50% flipped labels, median AUROC remains virtually flat, with the unsupervised variant serving as a natural performance floor. This is reassuring, but it should not be overread. It says CES is robust in this experimental protocol. It does not license sloppy labelling for every regulated deployment. Model-risk teams may enjoy robustness; auditors still enjoy evidence.

Finally, the synthetic error-bound experiment verifies the theory under controlled i.i.d. sampling. Since real QA generations are often short, the authors construct synthetic sequences by sampling token entropies from pooled faithful and hallucinated entropy distributions. In that setting, error rates decay exponentially with sequence length, and by 100 tokens both empirical error rates fall below 0.01. The paper also notes the practical boundary: for short factoid QA outputs, often 4 to 15 tokens, error rates remain moderate.

That last point is crucial. CES likes more tokens because distributional tests like more observations. A short answer such as “Paris” gives the detector almost nothing to chew on. Not even a clever statistic can dine elegantly on one crumb.

The formal contribution is calibration with error language

The mathematical part of the paper is not ornamental. It is there to turn a heuristic into a testable control.

The authors formalise hallucination detection as a hypothesis test. Under the null, the generation’s token entropies are consistent with the non-hallucinated reference distribution. If the CES score crosses a threshold, the system rejects the null and flags the output as likely hallucinated.

The calibration theory uses a random-length version of the Dvoretzky-Kiefer-Wolfowitz inequality. The reason is simple: generated outputs do not all have the same length. A standard empirical CDF concentration result assumes a fixed number of samples; the paper extends the argument to pooled entropy sequences whose lengths vary.

The power analysis then studies how CES behaves when hallucinated outputs have either higher mean entropy or a more extreme maximum-entropy tail. Under the paper’s assumptions, Type I and Type II errors decay exponentially with generation length.

This matters less because a business user will calculate the bound by hand, and more because it changes the governance language. A normal entropy heuristic says, “this answer looks uncertain.” CES says, “relative to the calibrated faithful distribution, this answer lands in a region we can threshold and analyse.”

That is a better object for monitoring, validation, and audit. Not perfect. Better.

The business value is cheaper triage, not automatic truth

The clearest enterprise use case is not replacing retrieval, human review, or domain-specific validation. It is triage.

A production LLM system can use CES as an inline risk score:

Generate an answer.
Capture token logprobs or logits.
Compute the entropy trace.
Map mean and max entropy through a calibrated reference ECDF.
Route high-CES outputs to a stricter workflow.

That stricter workflow might be retrieval verification, citation enforcement, human review, regeneration with constraints, refusal, or a lower-risk response template.

In other words, CES is a gatekeeper, not a judge with robes.

The method is especially relevant in workflows where hallucination cost is asymmetric: customer-support escalation, financial research summaries, compliance Q&A, procurement policy assistance, legal intake, medical admin support, and internal knowledge-base agents. In these settings, the business question is often not “Can we prove every sentence true in real time?” It is “Can we cheaply identify outputs that deserve extra scrutiny before they create damage?”

The answer suggested by the paper is: sometimes, yes, if token probability information is available and calibration is done against the right task distribution.

A useful deployment architecture would look like this:

Prompt + context
      ↓
LLM generation with logprobs/logits
      ↓
Entropy trace h1:T
      ↓
CES calibrated against faithful reference ECDF
      ↓
Risk band
      ↓
Low risk: normal response
Medium risk: retrieval check or constrained regeneration
High risk: human review, refusal, or escalation

The ROI logic is not that CES makes hallucination disappear. It reduces the number of expensive checks applied to low-risk answers and increases scrutiny where uncertainty behaviour looks abnormal. That is a routing function. Routing functions are quietly powerful because they improve the economics of assurance.

Where the result should not be stretched

The paper is refreshingly explicit about several boundaries.

First, CES cannot detect hallucinations that look statistically identical to faithful generations. If a model is confidently wrong because its training data or retrieved context contains incorrect information, the entropy trace may look normal. The detector sees uncertainty behaviour, not reality.

Second, short generations are difficult. Distributional tests need observations. The paper’s own limitation section notes degradation for short outputs, and the synthetic error-bound experiment reinforces why longer sequences provide more detection power.

Third, the empirical AUROC protocol is in-sample. The same samples are used to construct the reference ECDF and compute detection AUROCs, which the paper states clearly. This improves statistical power for reference estimation but means reported AUROCs should not be read as deployment-ready held-out performance. A real production system needs held-out validation, temporal drift checks, and domain-specific calibration.

Fourth, the oracle is delegated, not solved. The paper separates the definition of hallucination from detection by relying on an oracle during calibration. That is elegant, but it pushes part of the hard problem into labelling. If the oracle is misaligned with the business definition of error, CES will faithfully learn the wrong operational boundary. Very efficient nonsense is still nonsense; it just bills less compute.

Fifth, API feasibility depends on logprob access. The paper shows that top-logprob-derived entropy can work for API models in its tested setup. That does not guarantee every commercial model endpoint exposes sufficient probability information, stable tokenisation, or comparable logprob semantics.

What Cognaptus would take from this paper

The strongest business interpretation is not “use CES everywhere.” It is more precise:

Paper finding	Direct meaning	Cognaptus inference	Boundary
Entropy distributions differ in 72/80 experiments	Hallucinated and faithful outputs often have separable uncertainty traces	Token-level telemetry should be logged and analysed, not discarded	Effect sizes are modest and task-dependent
Mean-centred differences appear in 80/80 experiments	Shape carries signal beyond average entropy	Perplexity-only monitoring is under-instrumented	Shape signal does not guarantee high individual accuracy
CES joins the top statistical clique	Single-pass calibrated entropy can compete with costlier methods	Inline triage may be economically viable	Not universal winner across all tasks
Calibration contamination has little measured effect	The method is stable under polluted references in the experiment	Unsupervised or weakly supervised rollout may be feasible	Real domain drift still requires validation
Error declines with generation length under assumptions	More tokens improve distributional detection	CES is better suited to explanations than one-word answers	Synthetic i.i.d. tests are not full production proof

For enterprise AI teams, the paper supports a broader design principle: LLM observability should include the uncertainty trace, not just the generated text and final confidence score.

Many current deployments throw away probability telemetry. That is understandable when teams are racing to ship demos. It is less defensible once the system touches money, compliance, customer commitments, or regulated advice. CES is a reminder that the generation process itself contains risk information. Ignoring it is a choice, not a limitation.

The real contribution is a control layer around uncertainty

The paper’s title says entropy distribution is a fingerprint. That is a good metaphor, with one caveat. Fingerprints identify a person; CES identifies statistical abnormality relative to a reference. It is closer to a production sensor than a detective.

That makes it more useful, not less.

A production LLM stack does not need every component to be a final judge of truth. It needs layered controls: retrieval grounding, constrained generation, output validation, human escalation, monitoring, and feedback loops. CES fits as a cheap uncertainty-control layer inside that stack.

Its best role is early warning. Not because every high-entropy answer is false. Not because every low-entropy answer is safe. But because the shape of uncertainty can tell us when an answer deserves less trust than its prose suggests.

Fluency will always try to look like competence. CES gives operators one more way to check whether the machine’s calm voice was earned.

Cognaptus: Automate the Present, Incubate the Future.

Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Romina Yalovetzky, and Niraj Kumar, “Entropy Distribution as a Fingerprint for Hallucinations in Generative Models,” arXiv:2605.28264, 2026. https://arxiv.org/abs/2605.28264 ↩︎

The mechanism starts with a trace, not a verdict#

CES turns token uncertainty into a calibrated control signal#

The paper’s evidence first establishes the fingerprint#

The detector is lightweight because it avoids the expensive habits of stronger baselines#

The benchmark result is competitive, not magical#

The appendix is mostly robustness, not a second thesis#

The formal contribution is calibration with error language#

The business value is cheaper triage, not automatic truth#

Where the result should not be stretched#

What Cognaptus would take from this paper#

The real contribution is a control layer around uncertainty#