As large language models evolve into autonomous agents, their failures no longer stay confined to text—they materialize as actions. Clicking the wrong button, leaking private data, or falsely reporting success aren’t just hypotheticals anymore. They’re happening now, and MIRAGE-Bench is the first benchmark to comprehensively measure and categorize these agentic hallucinations.

Unlike hallucinations in chatbots, which may be amusing or embarrassing, hallucinations in LLM agents operating in dynamic environments can lead to real-world consequences. MIRAGE—short for Measuring Illusions in Risky AGEnt settings—provides a long-overdue framework to elicit, isolate, and evaluate these failures. And the results are sobering: even top models like GPT-4o and Claude hallucinate at least one-third of the time when placed under pressure.


Not All Mistakes Are Hallucinations

A crucial insight from MIRAGE is that hallucinated actions are not just wrong—they’re contextually unfaithful. That is, the agent invents or misperceives key aspects of its current situation and acts accordingly.

Here’s how hallucinated actions differ from other errors:

Aspect Wrong Actions Hallucinated Actions
Cause Misplanning, lack of knowledge Misperception or fabrication
Faithfulness issue Not always Always violates contextual reality
Example Clicks the wrong existing button Clicks a nonexistent button

This distinction matters because hallucinations signal deeper issues with how LLM agents interpret their environment—not just how well they execute plans.


Classifying the Mirage: A Three-Way Taxonomy

MIRAGE-Bench refines hallucination into three types based on which aspect of the agent’s cognitive context is violated:

  1. Unfaithful to Task Instructions — e.g., making up new goals or ignoring explicit constraints.
  2. Unfaithful to Interaction History — e.g., repeating a step it just completed, or assuming a success that never happened.
  3. Unfaithful to Environment Observations — e.g., clicking a button that isn’t visible or assuming navigation succeeded when it didn’t.

This agent-specific taxonomy moves beyond traditional NLP benchmarks like TruthfulQA, which only test factuality. In agentic settings, what matters is not truth per se, but fidelity to the moment.


Where Hallucinations Emerge: Six Risk Triggers

MIRAGE-Bench goes further by identifying six recurring patterns that trigger hallucinations:

Risk Setting Trigger Description Example Outcome
Out of Scope Queries Agent is asked something it can’t possibly know Fabricates a timeline or status update
Unexpected Environmental Transitions UI didn’t update as expected after an action Agent assumes success and proceeds erroneously
Unachievable Goal States Task requirements don’t match environment capability Agent fakes success or invents missing elements
Ill-Specified Instructions User-provided instructions are misleading or vague Agent wrongly trusts user diagnosis
Flawed Interaction History Agent forgets or contradicts what it previously did Repeats past actions or misjudges outcomes
Pop-up Distractions Interface contains misleading UI overlays Agent clicks ad banners or update modals

Each risk setting is instantiated with contextual snapshots—frozen decision points at which the agent must act. This lets MIRAGE isolate hallucination moments and evaluate them with precision.


Judging the Hallucination: The LLM-as-a-Judge Paradigm

Instead of human raters or hard-coded rules, MIRAGE uses LLMs themselves—like o4-mini or Claude—as judges. Each risk setting has a custom prompt that asks the judge model to:

  1. Identify whether a hallucination risk is present in the snapshot

  2. Score the agent’s action as:

    • 1: Faithful
    • 0.5: Incomplete (unclear intent)
    • 0: Hallucinated

These LLM-based judgments were validated against human labels with >75% agreement and are robust to prompt changes and temperature variation.


Even the Best Models Hallucinate

The headline result? No model is safe. In the MIRAGE benchmark:

  • GPT-4o: Hallucination Rate = 33.9%, Utility Score = 0.569
  • Claude 3.5 Sonnet: HR = 30.8%, US = 0.589
  • Qwen2.5-32B (open-source!): HR = 32.4%, US = 0.581

And these are under deterministic evaluation—no temperature randomness. The hallucinations aren’t freak accidents; they’re reliable failure modes.

Notably, one open-source model (Qwen2.5-32B) performed almost as well as GPT-4o, suggesting that proprietary scale alone isn’t enough. What matters more is alignment with agentic constraints—not just instruction following.


Implications: Why This Matters for Real Deployments

Hallucinations in agents aren’t just academic—they can be dangerous. Consider these real-world analogs:

  • Fabricated button clicks → Trigger wrong workflows or transactions
  • Assumed success → Skip error handling or approval steps
  • Misled responses to ambiguous queries → Generate false assurances or compliance reports

In environments like RPA, customer service, or enterprise automation, these behaviors would be disastrous. The MIRAGE framework helps teams diagnose and design against these risks.


Toward Safer Agents: What Comes Next

MIRAGE-Bench raises the bar for evaluating agentic hallucination, but it’s only a starting point. Future directions include:

  • Rollout-based dynamic tests (not just frozen snapshots)
  • Embodied agents in robotics or XR
  • Multi-agent risk settings, where hallucinations propagate across systems
  • Agent architecture reforms to track belief state and intent more explicitly

For now, any organization deploying LLM agents should treat hallucinations not as edge cases—but as endemic vulnerabilities.


Cognaptus: Automate the Present, Incubate the Future.