When Your Agent Knows It’s Lying: Detecting Tool-Calling Hallucinations from the Inside

Opening — Why this matters now

LLM-powered agents are no longer a novelty. They calculate loans, file expenses, query databases, orchestrate workflows, and—when things go wrong—quietly fabricate tool calls that look correct but aren’t. Unlike textual hallucinations, tool-calling hallucinations don’t merely misinform users; they bypass security controls, corrupt data, and undermine auditability.

In short: once agents touch real systems, hallucinations become operational risk.

The paper “Internal Representations as Indicators of Hallucinations in Agent Tool Selection” tackles this problem from an unusually pragmatic angle. Instead of asking the model again, sampling multiple outputs, or validating results externally, it asks a more unsettling question:

Does the model already know it’s hallucinating—internally—while it’s doing it?

The answer, it turns out, is yes.

Background — Tool use is brittle by design

Tool calling imposes rigid constraints on language models:

Function names must exist
Parameters must match schema and type
Required arguments cannot be missing
The right tool must be chosen for the query

Violations of these constraints show up as:

Hallucination Type	Description
Function selection error	Calling a non-existent tool
Function appropriateness error	Calling the wrong tool
Parameter error	Invalid or malformed arguments
Completeness error	Missing required arguments
Tool bypass	Generating outputs instead of calling tools

Existing detection methods—consistency checks, uncertainty estimates, semantic similarity—were built for free-form text. They struggle in structured settings and typically require multiple forward passes, which is poison for latency-sensitive agent systems.

This paper’s core claim is simple but sharp: hallucination signals already live inside the model’s hidden states during generation. You just have to look.

Analysis — Turning internal states into a safety signal

The authors frame tool-calling hallucination detection as a binary classification problem over internal representations of the LLM.

The key move

During normal tool-call generation, the model produces a final-layer hidden state for every token. The method extracts representations from three semantically meaningful locations:

The first token of the function name
The argument span (mean-pooled)
The closing delimiter token

These vectors are concatenated into a compact feature embedding:

$$ \mathbf{z} = [h_{func} ; || ; \text{mean}(h_{args}) ; || ; h_{end}] $$

A lightweight MLP classifier then predicts whether the tool call is hallucinated—using the same forward pass that generated the call in the first place. No resampling. No external checks.

How labels are generated (without humans)

Manual annotation is avoided entirely. Instead, the paper uses an unsupervised agreement trick:

Mask the ground-truth tool call
Ask the model to regenerate it
Compare predicted vs reference function and arguments
Agreement → non-hallucination; disagreement → hallucination

This produces naturally occurring hallucination labels at scale, especially in math-heavy domains where precision is unforgiving.

Findings — Performance without latency tax

Across three open-source models (GPT-OSS-20B, Llama-3.1-8B, Qwen-7B), the results are remarkably consistent:

Core results

Model	Accuracy	Hallucination Recall	Forward Passes
GPT-OSS-20B	86%	0.86	1
Llama-3.1-8B	73%	0.61	1
Qwen-7B	74%	0.62	1

Consistency-based baselines (NCP, semantic similarity) achieve higher precision but at a cost: 3–5× inference overhead due to repeated sampling.

The internal-state approach trades a bit of headline accuracy for something far more valuable in production systems: real-time, inline detection.

Ablation insight: complexity is overrated

An extensive ablation study shows that simple feature aggregation (mean pooling of last-layer states) performs as well as more complex schemes. Translation: the signal is strong, not fragile.

This is good news for engineers—less tuning, fewer moving parts, lower latency.

Implications — Agents that hesitate before acting

The business relevance here is understated but profound.

This method enables:

Execution gating: block risky tool calls before they run
User confirmation flows: surface uncertainty when it matters
Fallback strategies: reroute to safer tools or re-prompt
Audit-aware agents: detect bypass behavior automatically

Crucially, all of this happens inside the agent loop, not as a post-hoc patch.

For organizations deploying LLM agents in finance, healthcare, commerce, or operations, this reframes hallucination handling from a QA problem into a control-system problem.

Conclusion — Hallucinations leave fingerprints

This paper makes a quiet but consequential point: hallucinations are not purely emergent artifacts at the output layer. They are encoded, early and detectably, in the model’s internal state.

By exploiting that fact, the authors deliver something rare in LLM safety research—a method that is:

Technically grounded
Computationally cheap
Deployment-friendly
Aligned with real agent architectures

There are limitations: reference-based labeling is imperfect, cross-model generalization remains open, and function equivalence is still a thorny problem. But as a first step toward self-monitoring agents, this work sets a credible direction.

If agents are going to act autonomously, they must also learn when not to act.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Tool use is brittle by design#

Analysis — Turning internal states into a safety signal#

The key move#

How labels are generated (without humans)#

Findings — Performance without latency tax#

Core results#

Ablation insight: complexity is overrated#

Implications — Agents that hesitate before acting#

Conclusion — Hallucinations leave fingerprints#