TL;DR for operators

LLM agents do not merely hallucinate by saying false things. They hallucinate when they act on a version of the world that does not match the task, the history, or the screen in front of them.

That is the useful idea in MIRAGE-Bench: it treats agent hallucination as context-unfaithful action. The agent may click a button that is not there, assume a page transition succeeded when it did not, answer a colleague’s question with invented information, submit code despite failed tests, or report success when the environment says otherwise. Very industrious. Very confident. Very much not what you want near production systems.

The paper’s benchmark freezes agents at risky decision points drawn from environments including WebArena, WorkArena, SWE-Bench, OSWorld, TheAgentCompany, and τ-Bench. Instead of rerunning entire long trajectories, it captures the instruction, interaction history, and current observation immediately before a likely hallucination point, then asks models to choose the next action.

The finding is uncomfortable but practical: stronger models help, but they do not remove the problem. GPT-4o reports an overall hallucination rate of 0.339 and utility score of 0.569; Gemini-2.5-flash reports HR 0.308 and US 0.586; Claude-3.5-Sonnet reports HR 0.308 and US 0.589; Qwen2.5-32B-Instruct is close at HR 0.324 and US 0.581. The proprietary advantage exists, but it is not a magic moat. Apparently the mirage is not impressed by procurement tiering.

For operators, the implication is direct: evaluate agents at the decision points where hallucination has operational consequences. Before an agent can send messages, delete repositories, submit patches, click workflow controls, update records, approve transactions, or tell a human “done,” test whether it remains faithful to what it actually knows. MIRAGE-Bench is best understood as a diagnostic pattern for agent QA, not as proof that any model is safe or unsafe in every deployment.

The mirage starts when an agent treats missing context as permission

Consider a workplace agent asked to message several employees about password problems. It finishes the first message, attempts to navigate to the second employee’s direct message, but the interface does not actually switch. The observation still shows the first conversation. A reliable agent should notice that mismatch and retry, pause, or ask for help.

A hallucinating agent does something more dangerous: it behaves as if the transition succeeded. It sends the second person’s information into the first person’s chat.

That is not a factual hallucination in the old chatbot sense. It is not the model inventing a biography, a citation, or a tasteful lie about having “checked the database.” It is an action grounded in a false internal belief about the current state of the world.

MIRAGE-Bench, by Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, and Dawn Song, is built around this distinction.1 The paper defines agent hallucinations as actions unfaithful to the agent’s cognitive context: the task instruction, the interaction history, or the current environment observation. That definition is the paper’s real contribution. The benchmark matters because the failure has been named at the right level.

A wrong action is broad. It may come from poor planning, missing domain knowledge, bad tool grounding, or limited context length. A hallucinated action is narrower: the agent fabricates, misreads, or ignores something in the context and then acts as if the fabricated version were real.

That distinction matters in business settings because the remedies differ. If an agent lacks domain knowledge, you may improve retrieval or training data. If an agent misplans, you may add decomposition, better tools, or human review. If an agent is unfaithful to context, you need state verification, scope control, memory checks, and action gating. Giving the model more documents will not necessarily stop it from clicking the button it imagined. In fact, the paper explicitly notes that post-training or external knowledge integration may fail to ensure faithfulness and can even encourage over-reliance on prior knowledge over present context. A charming way to make the model smarter and less obedient to reality.

Three surfaces where agents lose fidelity

MIRAGE-Bench divides agent hallucination into three context surfaces. This is not decorative taxonomy. It is a debugging map.

Context surface What goes wrong Typical business version Why it matters
Task instructions The agent exceeds the user’s goal, invents an unstated intent, or follows a misleading instruction without checking it. A support agent gives unauthorised advice because the user implied it wanted a shortcut. The failure may look like initiative, which is exactly why it is dangerous.
Interaction history The agent forgets what already happened, ignores failed attempts, repeats ineffective steps, or treats unresolved test output as success. A coding agent submits a patch after failed verification because it narrates the failure as fixed. The agent’s own trajectory becomes unreliable evidence.
Environment observations The agent misreads the current UI, filesystem, webpage, tool output, or state transition. An automation agent clicks a non-existent field or assumes navigation succeeded after the page stayed unchanged. This is where false belief becomes external action.

The important correction is that hallucination in agents is not primarily about “truth” in the abstract. It is about fidelity to the situation.

A chatbot can hallucinate while sitting harmlessly in a text box. An agent hallucination is a decision made under a false model of the operational environment. When the agent has tools, permissions, and workflow access, the hallucination does not stay in language. It becomes a message sent, a file changed, a test ignored, a record updated, or a control clicked.

This is why MIRAGE-Bench’s framing is more useful than another general “AI makes mistakes” warning. It gives operators three questions to ask at every high-risk action:

  1. Is the action faithful to the user’s actual task?
  2. Is it faithful to what the agent has already done and observed?
  3. Is it faithful to the current environment state?

If the answer to any of those is no, the agent is not merely imperfect. It is acting on a mirage.

MIRAGE freezes the dangerous second before the mistake

The clever methodological move in the paper is the snapshot strategy.

Full agent rollouts are messy. The same starting task can branch differently across models, seeds, tools, timing, UI states, and accidental failures. One run gets stuck before the interesting part. Another reaches the key decision but through a different path. A third fails because of some unrelated interface issue. This makes hallucination hard to reproduce, compare, and score.

MIRAGE-Bench avoids that by collecting contextual snapshots. The authors run agents in interactive environments, identify the decision point where a risk trigger appears, filter out unrelated failures, and then freeze the full context: task instruction, interaction history, and current observation. Each model is evaluated on the next action from the same frozen state.

This is not a full substitute for dynamic deployment testing. It is a diagnostic lens. The snapshot asks: given the evidence available at this exact moment, does the model choose a context-faithful action?

The paper builds these snapshots from six environments:

Environment What it contributes to the benchmark
WebArena Web automation and e-commerce-style interface tasks
WorkArena Enterprise software workflows, including infeasible form and list operations
SWE-Bench Software debugging, code edits, execution feedback, and patch submission
OSWorld Operating-system style GUI contexts and pop-up observations
TheAgentCompany Workplace-style tasks with tools, files, chat, and simulated colleagues
τ-Bench Task-oriented interactions with databases and LLM-driven users

The benchmark is then scaled by contextual editing. For example, in out-of-scope workplace queries, the authors use o4-mini to rewrite only the relevant user reply node in an accessibility tree while preserving the surrounding structure. Appendix B reports that these edits cause less than 1% mean normalized tree-edit distance and over 99% node-level Jaccard overlap. That test is a synthetic-data fidelity check, not a second thesis. It supports the claim that edited snapshots preserve the surrounding environment while changing the reasoning challenge.

For operators, the lesson is straightforward: you do not need to evaluate every possible agent trajectory to start finding dangerous behaviour. Mine logs for risky decision points. Freeze them. Replay them across candidate models and agent policies. Score whether the next action remains faithful to the available context.

That is cheaper than discovering the same bug after the agent has informed the wrong customer, deleted the wrong repository, or triumphantly submitted broken code.

Six risk families become seven evaluated traps

The paper identifies six main risk settings. In the results table, flawed interaction history is split into repetitive and erroneous variants, so the benchmark reports seven evaluated categories.

Evaluated category What the risk tests Example failure pattern
Out-of-scope queries Whether an agent admits it lacks information when a human asks a related but unanswerable follow-up. The agent invents a department, timeline, policy, or meeting detail.
Unexpected environmental transitions Whether an agent notices that the environment did not change after an action. The agent assumes a DM switch succeeded and sends sensitive content to the wrong chat.
Unachievable goal states Whether an agent recognises that the requested goal cannot be completed in the current environment. The agent invents a missing field, selects a vaguely similar column, or reports success anyway.
Ill-specified instructions Whether an agent resists plausible but wrong user-provided reasoning. A coding agent accepts a misleading bug diagnosis and edits the wrong logic.
Flawed interaction history: repetitive Whether an agent notices it has repeated an ineffective action. The agent clicks the same invalid UI target again and again, as though persistence were a debugging strategy.
Flawed interaction history: erroneous Whether an agent correctly interprets failed verification or runtime output. The agent submits a patch despite visible errors or all-zero outputs.
Pop-up distractions Whether an agent ignores irrelevant but task-related environmental clutter. The agent clicks a promotional or update pop-up instead of continuing the original task.

The distribution is also revealing. The benchmark contains 1,050 snapshots: 232 out-of-scope query cases, 199 flawed-history erroneous cases, 152 flawed-history repetitive cases, 141 unachievable goal cases, 138 ill-specified instruction cases, 100 pop-up distraction cases, and 88 unexpected transition cases.

This distribution should not be read as “out-of-scope queries are the most common hallucination in the world.” It reflects the authors’ curated benchmark composition. The more useful point is that hallucination risk is not a single bug. It appears when the agent must decide whether to trust the instruction, the history, the environment, or its own helpful impulse.

The out-of-scope setting is especially business-relevant because it looks like ordinary workplace communication. A colleague asks a follow-up question. The agent does not know the answer. The correct behaviour is to say it lacks the information, ask for clarification, or escalate. Many agents instead answer confidently.

That is not just wrong. It is socially dangerous because it arrives in the format of workplace authority. A fabricated answer in chat can become someone else’s input, and now the hallucination has entered the organisation’s bloodstream. Congratulations, the spreadsheet is haunted.

The headline is persistence, not which model wins

The paper evaluates a broad set of instruction-tuned open and proprietary models using deterministic decoding. Its two headline metrics are:

  • Utility Score (US): average action quality, with faithful, incomplete, and hallucinated actions mapped into a utility-style score.
  • Hallucination Rate (HR): the proportion of snapshots where the model produces a clear hallucinated action.

The exact winner is less important than the pattern. Hallucination persists across model families.

Model Overall Utility Score Overall Hallucination Rate
DeepSeek-reasoner 0.641 0.257
Claude-3.7-Sonnet 0.585 0.290
Claude-3.5-Sonnet 0.589 0.308
Gemini-2.5-flash 0.586 0.308
Qwen2.5-32B-Instruct 0.581 0.324
GPT-4o-2024-11-20 0.569 0.339
Qwen2.5-72B-Instruct 0.579 0.334
DeepSeek-chat 0.549 0.362
GPT-4o-mini-2024-07-18 0.518 0.381
Qwen2.5-7B-Instruct 0.488 0.416
Llama-3.3-70B-Instruct 0.488 0.433
Llama-3.1-70B-Instruct 0.480 0.442

Two cautions belong beside this table.

First, the paper marks DeepSeek-chat and DeepSeek-reasoner as incomplete in the main results because some snapshots reach around 10k tokens and create context-window constraints. Their numbers are therefore informative, but not cleanly comparable in the same way as fully completed evaluations.

Second, the model ranking is not the business story. The business story is that several capable models still hallucinate in roughly a quarter to over two-fifths of risky snapshots. The benchmark is intentionally stress-inducing, so those rates should not be projected onto all enterprise tasks. But if your deployment includes similar decision points, the result is a large red sign reading: “Do not trust generic task success metrics alone.”

The proprietary-versus-open comparison is also more subtle than a procurement department might like. Proprietary models generally perform better, but the gap is modest. Qwen2.5-32B-Instruct reports US 0.581 and HR 0.324, close to GPT-4o’s US 0.569 and HR 0.339, and near Claude-3.5-Sonnet and Gemini-2.5-flash on overall utility. The implication is not “open models are just as good everywhere.” The implication is sharper: buying a stronger model does not remove the need for agent-specific faithfulness testing.

Scale and instruction tuning help. They do not dissolve mirages.

The failures are uneven, which is exactly the useful part

The risk categories do not behave the same way.

Unachievable goal states are painful for many models. When a field, column, resource, or capability is missing, agents often try to force completion instead of reporting infeasibility. In WorkArena, the paper gives the example of a task asking the agent to sort by a non-existent “Company eye” column. The agent sees “Company,” treats it as close enough, and proceeds. This is the enterprise automation version of squinting until reality becomes convenient.

Unexpected environmental transitions are another high-value category. The agent takes an action, but the environment remains unchanged. A faithful agent must update its plan based on the new observation. A hallucinating agent continues as if its previous intention had become reality. In workflow systems, that is how agents skip confirmation steps, misroute messages, and compound small UI failures into material errors.

Ill-specified instructions capture a different mechanism. The user provides plausible but wrong reasoning. In SWE-Bench examples, the agent can accept a misleading bug diagnosis and implement a fix aligned with the story rather than the code evidence. This failure is particularly relevant for enterprise copilots because business users often provide explanations, not just requests. Some of those explanations are wrong. Yes, even from senior people. Especially from senior people.

Flawed interaction history comes in two flavours. Repetitive failures test whether the agent recognises it is stuck. Erroneous final-step failures test whether it properly interprets verification output before finalising. The second is crucial for coding and analytics agents. A model that narrates failed tests as successful verification is not “almost done.” It is producing a false completion signal.

Pop-up distractions are the interesting exception. In this benchmark, most models largely ignore injected pop-up elements and continue the original task. The paper suggests a possible reason: the pop-ups are appended in accessibility-tree form, which may make them structurally easier to separate from the main task than visually occluding pop-ups in multimodal GUI settings. Stronger models show mild susceptibility because they may attend more broadly to extraneous context. That result is useful precisely because it does not scream catastrophe. It says observation format matters.

For business teams, the unevenness is good news. It means agent QA should not be a generic “hallucination score.” It should be a risk profile.

Paper result Operational interpretation
Missing affordances trigger forced completion. Add infeasibility reporting, allow safe refusal, and test agents on deliberately impossible tasks.
Failed transitions are often ignored. Require state-change verification after clicks, navigation, submissions, and record updates.
Misleading instructions can override evidence. Make agents validate user assumptions against tools, files, logs, or tests before acting.
Repetitive loops persist without awareness. Add loop detectors and force alternative planning after repeated ineffective actions.
Failed verification may be narrated as success. Block finalisation until objective checks pass or unresolved signals are explicitly explained.
Pop-up susceptibility depends on observation structure. Treat UI representation as a safety variable, not just a convenience detail.

LLM-as-judge is the scaffolding, not the foundation stone

MIRAGE-Bench uses LLM-as-judge evaluation because agent hallucinations are hard to score with simple string matching or external factuality tools. The judge receives the context, the agent’s reasoning and action, and a risk-specific rubric. It then assigns a score.

This is reasonable for the problem. It is also a boundary.

The authors do not simply wave a judge model at the benchmark and call it science. They validate the judging framework on a 160-sample subset across risk settings. Against human reference annotations, o4-mini reaches accuracy 0.756 and ZeroAcc 0.789; Claude-3.5-Sonnet reaches accuracy 0.769 and ZeroAcc 0.895; Gemini-2.5-flash reaches accuracy 0.769 and ZeroAcc 0.806. ZeroAcc measures agreement specifically on hallucinated cases.

They also test self-consistency for o4-mini under temperature variation, reporting accuracy 0.849 and 0.819 against the deterministic baseline, with ZeroAcc 0.845 and 0.831. Prompt-format perturbation tests report accuracy above 0.75 and ZeroAcc above 0.85.

These are robustness and validation tests, not main evidence about agent safety. Their purpose is to support the evaluation machinery. They show that the judging method is not wildly unstable across model judges, sampling variation, or prompt structure. They do not prove that every individual label is correct, nor that LLM judges should replace human review in regulated workflows.

For enterprise use, the right interpretation is pragmatic: LLM judges can triage and scale evaluation, especially for internal agent QA. But high-impact action categories should still get human-audited samples, disagreement analysis, and incident review. The judge is a useful instrument. It is not the compliance department wearing a clever hat.

What MIRAGE changes in enterprise agent QA

The most useful business inference from the paper is not “choose Model X.” It is “test agents where hallucination becomes action.”

A conventional agent evaluation often asks whether the task eventually completes. MIRAGE-Bench asks whether the next action is faithful at the decision point where things can go wrong. That changes the QA unit from whole-task success to risk-triggered decision quality.

For a company deploying LLM agents, that suggests a practical evaluation stack:

Deployment control What to test using MIRAGE-style snapshots
Scope boundary control Does the agent refuse or escalate when asked for information outside its available context?
State transition verification After a click, navigation, upload, API call, or workflow update, does the agent confirm the environment actually changed?
Infeasibility handling When a requested field, file, user, permission, or product does not exist, does the agent report the blocker instead of improvising?
User-claim verification Does the agent test user-provided diagnoses against evidence before editing code, changing records, or advising action?
Loop awareness Does the agent detect repeated failed actions and change strategy?
Completion gating Does the agent avoid “done” unless logs, tests, tool outputs, or state checks support completion?
Distraction resistance Does the agent distinguish primary task affordances from unrelated pop-ups, banners, notifications, or recommendations?

This is where MIRAGE-Bench has immediate procurement value. Instead of comparing agents on generic demos, an enterprise buyer can construct a private snapshot suite from its own workflows:

  • failed login transitions;
  • missing CRM fields;
  • ambiguous approval messages;
  • repeated failed upload attempts;
  • unresolved test output;
  • misleading user notes;
  • irrelevant but clickable UI overlays;
  • incomplete database responses;
  • tool errors that look superficially successful.

Then compare models, prompts, agent frameworks, memory policies, and guardrails on the same frozen contexts.

This matters because agent reliability is not just a model property. It is a system property. The model, prompt, tool wrapper, observation format, memory representation, retry logic, and approval policy all shape whether the agent acts on reality or on vibes with syntax.

The paper directly shows less than managers may want, and more than vendors may enjoy

MIRAGE-Bench directly shows three things.

First, agent hallucination can be operationalised as contextual unfaithfulness across task instruction, interaction history, and environment observation.

Second, realistic agent benchmarks contain recurring risk settings where this unfaithfulness can be elicited, frozen, scaled, and evaluated.

Third, current strong models still produce substantial hallucinated actions in those risk settings, with only modest separation between some proprietary and open models.

Cognaptus would infer a fourth point for business use: agent deployment should be gated by risk-specific snapshot testing, not just aggregate task-completion demos.

That inference is reasonable, but it is still an inference. The paper does not prove that a given enterprise agent will fail at the same rates. It does not test every tool stack, permission design, UI modality, memory architecture, or human-in-the-loop process. It does not evaluate long-run incident frequency after mitigations. It provides a diagnostic benchmark, not an insurance policy.

The distinction matters. A vendor could score well on MIRAGE-style snapshots and still fail in dynamic deployment because of tool latency, hidden UI states, multimodal perception errors, permission edge cases, or poorly designed escalation flows. Conversely, a model with mediocre raw scores may become safer inside a well-designed agent architecture with strict state checks and constrained actions.

MIRAGE-Bench measures a failure mechanism. It does not replace system engineering.

Where the evidence stops

The paper’s limitations are not fatal, but they are operationally important.

The first boundary is static evaluation. Snapshots freeze one decision point. That improves reproducibility and comparison, but it cannot fully capture how agents recover or deteriorate over subsequent steps. The paper acknowledges this and calls for dynamic rollout-based assessments.

The second boundary is LLM-as-judge dependence. The authors validate judge agreement with humans and test robustness, which strengthens the method. But judge models can still inherit blind spots, especially on subtle domain-specific actions. In high-stakes settings, LLM judging should be a scaling layer, not the final authority.

The third boundary is modality. Much of the benchmark relies on textual representations such as accessibility trees. The pop-up result especially should not be overgeneralised to visual GUI agents or embodied agents. A pop-up appended neatly to an accessibility tree is not the same as a visually occluding modal in a live browser. Observation format changes behaviour.

The fourth boundary is coverage. The six risk families are representative, not exhaustive. Real deployments will create additional risk settings: stale retrieved documents, partial API failures, conflicting permissions, multi-agent handoff errors, hidden prompt injections, customer-specific policy exceptions, and silent data quality problems. The mirage has cousins.

The fifth boundary is model and context constraints. Some evaluations are incomplete for models affected by context-window limits. Long interaction histories are precisely where agent systems often struggle, so context management should be treated as part of the deployment risk, not a footnote.

These boundaries should guide adoption, not dismiss the work. The correct response is not “the benchmark is imperfect, therefore ignore it.” The correct response is “use the mechanism, adapt the snapshots, and test the risks your system actually creates.”

The practical conclusion: buy agents like control systems, not chatbots

MIRAGE-Bench is useful because it moves the conversation from “LLMs hallucinate” to “agents act unfaithfully at identifiable decision points.”

That is the level where operators can do something.

A chatbot hallucination asks for better answers. An agent hallucination asks for better controls: scope boundaries, state verification, infeasibility reporting, loop detection, finalisation gates, human escalation, and environment-aware evaluation. The difference is not cosmetic. It is the difference between a wrong paragraph and a wrong action.

The common misconception is that frontier models will mostly solve this by getting smarter. MIRAGE-Bench makes that belief harder to maintain. Stronger models reduce some failures, but they still assume transitions, fabricate missing details, over-answer out-of-scope queries, follow misleading instructions, and report completion on shaky evidence.

So the operator’s question should change.

Not: “Which agent is most impressive in a demo?”

Ask instead: “At the exact moment this agent is about to send, submit, delete, approve, edit, or declare success, does it still know what is real?”

That is less glamorous than another autonomous-agent showcase. It is also much closer to how production systems survive contact with reality.

Reality, inconveniently, still matters.

Cognaptus: Automate the Present, Incubate the Future.


  1. Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, and Dawn Song, “MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them,” arXiv:2507.21017, 2025. https://arxiv.org/pdf/2507.21017 ↩︎