Mirage Agents: When LLMs Act on Illusions

As large language models evolve into autonomous agents, their failures no longer stay confined to text—they materialize as actions. Clicking the wrong button, leaking private data, or falsely reporting success aren’t just hypotheticals anymore. They’re happening now, and MIRAGE-Bench is the first benchmark to comprehensively measure and categorize these agentic hallucinations.

Unlike hallucinations in chatbots, which may be amusing or embarrassing, hallucinations in LLM agents operating in dynamic environments can lead to real-world consequences. MIRAGE—short for Measuring Illusions in Risky AGEnt settings—provides a long-overdue framework to elicit, isolate, and evaluate these failures. And the results are sobering: even top models like GPT-4o and Claude hallucinate at least one-third of the time when placed under pressure.

Not All Mistakes Are Hallucinations

A crucial insight from MIRAGE is that hallucinated actions are not just wrong—they’re contextually unfaithful. That is, the agent invents or misperceives key aspects of its current situation and acts accordingly.

Here’s how hallucinated actions differ from other errors:

Aspect	Wrong Actions	Hallucinated Actions
Cause	Misplanning, lack of knowledge	Misperception or fabrication
Faithfulness issue	Not always	Always violates contextual reality
Example	Clicks the wrong existing button	Clicks a nonexistent button

This distinction matters because hallucinations signal deeper issues with how LLM agents interpret their environment—not just how well they execute plans.

Classifying the Mirage: A Three-Way Taxonomy

MIRAGE-Bench refines hallucination into three types based on which aspect of the agent’s cognitive context is violated:

Unfaithful to Task Instructions — e.g., making up new goals or ignoring explicit constraints.
Unfaithful to Interaction History — e.g., repeating a step it just completed, or assuming a success that never happened.
Unfaithful to Environment Observations — e.g., clicking a button that isn’t visible or assuming navigation succeeded when it didn’t.

This agent-specific taxonomy moves beyond traditional NLP benchmarks like TruthfulQA, which only test factuality. In agentic settings, what matters is not truth per se, but fidelity to the moment.

Where Hallucinations Emerge: Six Risk Triggers

MIRAGE-Bench goes further by identifying six recurring patterns that trigger hallucinations:

Risk Setting	Trigger Description	Example Outcome
Out of Scope Queries	Agent is asked something it can’t possibly know	Fabricates a timeline or status update
Unexpected Environmental Transitions	UI didn’t update as expected after an action	Agent assumes success and proceeds erroneously
Unachievable Goal States	Task requirements don’t match environment capability	Agent fakes success or invents missing elements
Ill-Specified Instructions	User-provided instructions are misleading or vague	Agent wrongly trusts user diagnosis
Flawed Interaction History	Agent forgets or contradicts what it previously did	Repeats past actions or misjudges outcomes
Pop-up Distractions	Interface contains misleading UI overlays	Agent clicks ad banners or update modals

Each risk setting is instantiated with contextual snapshots—frozen decision points at which the agent must act. This lets MIRAGE isolate hallucination moments and evaluate them with precision.

Judging the Hallucination: The LLM-as-a-Judge Paradigm

Instead of human raters or hard-coded rules, MIRAGE uses LLMs themselves—like o4-mini or Claude—as judges. Each risk setting has a custom prompt that asks the judge model to:

Identify whether a hallucination risk is present in the snapshot
Score the agent’s action as:
- 1: Faithful
- 0.5: Incomplete (unclear intent)
- 0: Hallucinated

These LLM-based judgments were validated against human labels with >75% agreement and are robust to prompt changes and temperature variation.

Even the Best Models Hallucinate

The headline result? No model is safe. In the MIRAGE benchmark:

GPT-4o: Hallucination Rate = 33.9%, Utility Score = 0.569
Claude 3.5 Sonnet: HR = 30.8%, US = 0.589
Qwen2.5-32B (open-source!): HR = 32.4%, US = 0.581

And these are under deterministic evaluation—no temperature randomness. The hallucinations aren’t freak accidents; they’re reliable failure modes.

Notably, one open-source model (Qwen2.5-32B) performed almost as well as GPT-4o, suggesting that proprietary scale alone isn’t enough. What matters more is alignment with agentic constraints—not just instruction following.

Implications: Why This Matters for Real Deployments

Hallucinations in agents aren’t just academic—they can be dangerous. Consider these real-world analogs:

Fabricated button clicks → Trigger wrong workflows or transactions
Assumed success → Skip error handling or approval steps
Misled responses to ambiguous queries → Generate false assurances or compliance reports

In environments like RPA, customer service, or enterprise automation, these behaviors would be disastrous. The MIRAGE framework helps teams diagnose and design against these risks.

Toward Safer Agents: What Comes Next

MIRAGE-Bench raises the bar for evaluating agentic hallucination, but it’s only a starting point. Future directions include:

Rollout-based dynamic tests (not just frozen snapshots)
Embodied agents in robotics or XR
Multi-agent risk settings, where hallucinations propagate across systems
Agent architecture reforms to track belief state and intent more explicitly

For now, any organization deploying LLM agents should treat hallucinations not as edge cases—but as endemic vulnerabilities.

Cognaptus: Automate the Present, Incubate the Future.

Not All Mistakes Are Hallucinations#

Classifying the Mirage: A Three-Way Taxonomy#

Where Hallucinations Emerge: Six Risk Triggers#

Judging the Hallucination: The LLM-as-a-Judge Paradigm#

Even the Best Models Hallucinate#

Implications: Why This Matters for Real Deployments#

Toward Safer Agents: What Comes Next#