Seeing Is Deceiving: Diagnosing and Fixing Hallucinations in Multimodal AI

TL;DR for operators

A multimodal model can look at an image and still answer from memory, habit, or linguistic guesswork. That is the uncomfortable core of visual hallucination: the output is fluent, relevant-looking, and sometimes even useful, while being only loosely attached to the pixels it claims to describe.

The practical lesson is not “never use multimodal AI.” That would be tidy, dramatic, and mostly useless. The lesson is narrower and more valuable: visual hallucinations need to be diagnosed by where grounding fails, not merely counted after the model has embarrassed itself.

The strongest operator takeaway is a three-part workflow:

Operator question	Technical translation	Practical action
Does the model hallucinate under visual stress?	Benchmark visually grounded faithfulness, not generic helpfulness	Use adversarial image-question tests, including absence, attribute, relation, OCR, and conflict cases
Where does the hallucination arise?	Check whether generation is actually using image evidence	Inspect image-token attention, modality contribution, or component-level attribution
Can we reduce it without rebuilding the whole model?	Add targeted decoding, reranking, critic, or preference-alignment layers	Deploy verifier gates and abstention paths before high-risk outputs reach users

The key business inference is modest but useful: hallucination control is becoming less like mystical model selection and more like reliability engineering. You still need domain tests. You still need escalation rules. You still need humans where the cost of a false visual claim is high. But you no longer need to pretend that “bigger multimodal model” is a safety strategy. That was never a strategy; it was a procurement mood.

Inspection photos are not evidence if the model treats them as decoration

Picture a property insurer using a multimodal model to review storm-damage photos. The model sees a roof, a water stain, a broken window, and a user question. It replies with a crisp explanation: “The image shows missing shingles and interior water intrusion.” The sentence is plausible. It even sounds professionally useful. The problem is that one of those details may not be in the image.

This is where multimodal hallucination becomes more treacherous than ordinary text hallucination. A text-only model inventing a citation at least fails in a familiar register: it claims knowledge it does not have. A vision-language model that invents a visual detail violates a stronger user assumption. The user thinks the model is reporting what it sees. In reality, the model may be completing a scene from learned priors.

The research thread behind MMHal-Bench and Factually Augmented RLHF makes this failure operationally legible: large multimodal models can produce textual outputs that are not grounded in the supplied multimodal context, and standard helpfulness-style evaluation does not sufficiently penalise that behaviour.¹ That distinction matters. A charming answer is not the same thing as a faithful answer. The industry has spent several years learning this in text. Multimodal AI now offers the same lesson, but with pictures, diagrams, forms, X-rays, receipts, dashboards, and all the other things people trust because they look concrete.

The old framing says: “The model hallucinated.” The better framing asks: “Which part of the system stopped listening to the image?”

MMHal-Bench is a pressure test, not a census

The original version of this article treated MMHal-Bench as though scale were the point. That misses the actual value. MMHal-Bench is not impressive because it is enormous. It is useful because it is pointed.

The benchmark contains 96 challenging image-question pairs based on OpenImages, organised to test whether a model’s response stays grounded in the image rather than drifting toward plausible but unsupported claims.¹ The Hugging Face dataset card describes it as a hallucination-specific benchmark for large multimodal models, with image-question pairs, ground-truth answers, image contents, and example model responses.² In other words, it is closer to an adversarial smoke test than a panoramic audit of every visual failure a production system may encounter.

That difference is not a footnote. It changes how a product team should use the benchmark.

What MMHal-Bench supports	What it does not support
Comparing models on a focused visual-faithfulness challenge	Certifying a model as safe for medical, insurance, legal, or industrial use
Detecting whether an answer introduces unsupported visual content	Measuring every hallucination type across every domain
Stress-testing grounding under selected object and question categories	Proving robustness under video, multi-image, OCR-heavy, or specialised scientific imagery
Penalising confident visual invention	Replacing domain-specific evaluation and human escalation

A small benchmark can still be valuable if it exposes a failure mode cleanly. The mistake is to treat a benchmark score as a deployment passport. Benchmarks are customs officers, not citizenship papers.

The reported result from the Fact-RLHF work is still meaningful: the authors developed MMHal-Bench specifically to penalise hallucinations and reported a 60% improvement on MMHal-Bench over baselines, alongside stronger LLaVA-Bench performance.¹ That shows that alignment and evaluation can be pointed toward visual faithfulness. It does not show that the model has become a reliable visual expert across arbitrary workflows.

Operators should read this result as evidence for a design principle: measure the failure you actually care about. If the risk is visual invention, do not rely on a generic “quality” score. Generic quality is how hallucinations get promoted into production with excellent manners.

The costly idea: hallucination is often a modality-routing problem

The intuitive explanation for hallucination is that the model “doesn’t know.” Sometimes that is true. But in multimodal systems, the sharper problem is often that the model has access to visual evidence and still lets language dominate.

Vision-language models typically connect a vision encoder to a language model through some alignment mechanism. The language model remains extremely strong at generating fluent continuations. That is helpful when the image is clear and the query is ordinary. It is dangerous when the image contradicts common priors.

A model asked about a kitchen may mention a refrigerator because kitchens usually contain refrigerators. A model asked about a street may infer cars, signs, or pedestrians because streets often contain them. These priors are not random. They are the residue of training. The difficulty is that plausible scene completion is exactly what we do not want when the user needs visual evidence.

The survey literature on MLLM hallucination separates hallucinations into categories such as object, attribute, and relation errors, and traces causes across data, model, training, and inference stages.³ That taxonomy is useful because it stops teams from treating hallucination as a single blob. An object hallucination in product photography is not the same operational risk as a relation hallucination in a safety inspection, or an OCR hallucination in a customs document.

The mechanism is therefore not “the model is dumb.” It is more irritating: the model is often smart enough to guess well and insufficiently disciplined to say, “I cannot see that.”

Localization changes the question from blame to repair

Once hallucination is viewed as a modality-routing problem, the next question becomes anatomical. Where is the system drifting away from the image?

A 2025 ICLR paper on modular attribution and intervention attacks this directly. The authors use causal mediation analysis and counterfactual edits to identify components that contribute to hallucinated words in large vision-language models. Their main finding is not that “the whole model hallucinates,” which would be scientifically accurate and operationally useless. They find that multi-head attention modules contribute more to hallucination-word probability than MLP modules, and that specific “hallucination heads” tend to concentrate in middle and deeper layers while showing a strong bias toward text tokens.⁴

That is a much more actionable diagnosis. If hallucination-prone components are biased toward text, mitigation can target the imbalance rather than fine-tune the entire model and hope the smoke clears. The same work reports up to a 1.7x reduction in hallucination rate for LLaVA-v1.5-7B on COCO captioning through targeted interventions.⁴

Another line of evidence comes from image-token attention-guided decoding, or iTaD. Xu and colleagues observe that hallucinated output segments tend to show reduced attention from output tokens to image tokens.⁵ Their paper gives a concrete example across four models:

Model	Image-token attention without hallucination	Image-token attention with hallucination
LLaVA-1.5	12.0%	10.9%
InstructBLIP	79.2%	74.8%
MiniGPT-4	40.2%	37.7%
mPLUG-Owl	59.8%	56.0%

The absolute percentages vary wildly across architectures, so the numbers should not be compared as if they were a universal health score. The pattern is the point: within each model, hallucinated segments are associated with lower attention to image tokens. The model is still speaking. It is just speaking with less visual supervision.

This is why localization matters. A product team that only tracks final-answer accuracy is measuring the corpse after the accident. A team that monitors visual grounding signals during generation has a chance to intervene earlier.

The “Visual Assistant” should be read as a control layer, not a magic chaperone

The original article described a Visual Assistant as a plug-in module that evaluates candidate outputs against image evidence and reranks or filters the answer before release. The useful way to understand this idea is not as a named gadget. It is a system pattern: add a control layer that checks whether the language output remains visually grounded.

That layer can take several forms:

Control layer	How it reduces hallucination risk	Cost profile	Main weakness
Reranker	Scores multiple candidate answers for visual consistency	Moderate latency, no full retraining	Depends on the verifier’s own visual competence
Critic model	Flags unsupported entities, attributes, or relations	Useful for review workflows	May miss subtle domain errors
Attention-guided decoding	Adjusts generation using internal visual-attention signals	Plug-in at inference time	Architecture-dependent and not always available through APIs
Preference alignment	Trains the model to prefer visually faithful responses	Stronger model-level behaviour change	Requires data, training, and regression testing
Abstention gate	Forces “not visible / uncertain” when evidence is weak	Operationally simple	Can frustrate users if overused

This is the difference between research novelty and deployment value. The business value is not that one more clever module exists. It is that hallucination mitigation can be layered. You can combine benchmark tests, grounding monitors, candidate reranking, and escalation rules without rebuilding the base model every quarter like a penitent monk.

Entity-centric preference optimisation offers another route. EMPO argues that existing preference alignment can neglect image-text modality alignment, causing over-reliance on language. It constructs preference data around image, instruction, and response aspects, and reports reductions in hallucination rates of 85.9% on Object-HalBench and 49.8% on MM-HalBench in its experimental setting.⁶ Those are substantial results, but they should be interpreted as benchmark-specific evidence, not a universal discount coupon for risk.

The common thread is clear: hallucination mitigation improves when it forces the model to pay rent to the image.

Main result, robustness check, and business meaning are not the same thing

A recurring failure in AI commentary is to flatten every experiment into “the method works.” That is convenient. It is also how nuance goes to die.

Here is the cleaner reading:

Evidence type	What it shows	Business meaning	Boundary
MMHal-Bench performance	Visual-faithfulness evaluation can expose hallucination-prone behaviour	Add hallucination-specific test sets to model selection	Small benchmark; not domain certification
Fact-RLHF gains	Reward models augmented with factual visual information can reduce hallucination-oriented failures	Preference data should encode grounding, not just user liking	Requires training pipeline and careful regression checks
Modular attribution	Hallucination can be concentrated in identifiable attention components	Internal diagnostics can guide targeted mitigation	Mostly useful where model internals are accessible
Image-token attention patterns	Hallucinated segments may correlate with reduced visual attention	Runtime signals can support gating or decoding controls	Attention is a signal, not a full causal explanation by itself
Entity-centric preference optimisation	Fine-grained image-text alignment can reduce benchmark hallucinations	Preference data should focus on entities, attributes, and relations	Transfer depends on domain, data quality, and task format

The paper-level results directly show that hallucination can be measured, localised, and reduced under defined experimental conditions. Cognaptus’ business inference is that organisations should build multimodal reliability stacks around these principles. What remains uncertain is the degree to which any one mitigation transfers to your particular production domain, especially when the images are messy, proprietary, low-resolution, adversarial, or legally consequential.

That distinction is not academic politeness. It is the line between engineering and sales theatre.

The business value is cheaper diagnosis, not “solved hallucination”

For operators, the useful shift is from model worship to fault isolation.

If you are deploying multimodal AI in claims handling, construction monitoring, warehouse inspection, healthcare triage, retail catalogue enrichment, or document-image workflows, hallucination control should be designed into the operating model. The right question is not “Which model hallucinates least on the internet?” It is “Which errors can we detect before they damage our process?”

A practical deployment stack looks like this:

Domain-specific visual stress tests. Start with benchmark categories such as object absence, incorrect attributes, relations, counting, visual text, and conflicting context. Then build your own examples from real workflow failures. Public benchmarks are templates, not substitutes.
Grounding-sensitive prompts and output schemas. Force the model to separate “visible evidence,” “inference,” and “uncertain / not visible.” This will not eliminate hallucination, but it makes unsupported claims easier to catch. Ambiguity hidden inside prose is where risk goes to breed.
Verifier or critic layer for high-risk outputs. Use a second model, rule-based checker, object detector, OCR engine, or specialised classifier to challenge claims about entities, attributes, and relations. The point is not philosophical certainty. The point is disagreement detection.
Abstention and escalation rules. A model that cannot say “not enough visual evidence” is not ready for serious visual workflows. If the product experience requires an answer every time, the product experience is part of the hallucination problem.
Post-deployment sampling. Track hallucination categories over time. Models drift less dramatically than business processes, but business data changes: new forms, new image angles, new suppliers, new lighting, new user behaviour. Yesterday’s clean validation set becomes tomorrow’s decorative scrapbook.

The ROI story is therefore not “AI replaces visual review.” It is more specific: AI can pre-process, classify, describe, and triage visual evidence when the system is designed to detect unsupported visual claims and route uncertain cases appropriately. The savings come from reducing routine review load, not from pretending the model has acquired professional liability insurance.

Where this applies, and where it does not

This research direction applies best to workflows where hallucinations can be expressed as mismatches between generated text and supplied visual evidence: object existence, object attributes, spatial relations, counts, visible text, and scene descriptions. It is especially relevant when outputs can be constrained, checked, reranked, or escalated.

It applies less cleanly to tasks requiring specialised expert interpretation. A pathology slide, satellite image, engineering defect photo, or legal document scan may require domain knowledge that is not captured by generic visual grounding. In those settings, a model may be visually faithful but professionally wrong. Faithfulness is necessary; it is not expertise.

There are also technical limits. Attention-based diagnostics are useful, but attention is not a universal truth serum. External verifier models can share the same blind spots as the generator. Preference optimisation can improve benchmark behaviour while introducing new regressions. Small hallucination benchmarks are easy to over-celebrate and eventually easy to overfit. Multi-image and video tasks add temporal consistency problems that single-image tests do not cover neatly.

The boundary is simple: these methods make hallucination more diagnosable and more manageable. They do not make visual AI self-certifying. Any vendor implying otherwise should be asked to upload a picture of their evidence. Then perhaps a second model can check whether the evidence exists.

Conclusion: make the model prove it saw the thing

Multimodal hallucination is dangerous because it borrows authority from the image. The user sees a photo, asks a question, and assumes the answer is grounded in visual inspection. The model, meanwhile, may be blending pixels with priors, prompt cues, dataset habits, and the language model’s ancient desire to complete the sentence.

The better path is not to abandon multimodal AI. It is to stop treating visual output as self-authenticating. Benchmarks such as MMHal-Bench expose the failure. Localization work shows that the failure can be traced to modality imbalance and specific internal components. Mitigation methods show that visual grounding can be improved through targeted intervention, decoding, reranking, and preference alignment.

For business teams, the lesson is refreshingly unglamorous: build the controls around the failure mode. Test for visual faithfulness. Monitor whether image evidence is being used. Add critic layers where claims matter. Escalate uncertainty. Separate what the model sees from what it infers.

Seeing is not believing. In multimodal AI, seeing is a hypothesis. The system still has to prove it looked.

Cognaptus: Automate the Present, Incubate the Future.

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell, “Aligning Large Multimodal Models with Factually Augmented RLHF,” arXiv:2309.14525, 2023; Findings of ACL 2024. https://arxiv.org/abs/2309.14525 ↩︎ ↩︎ ↩︎
Shengcao Cao, “MMHal-Bench,” Hugging Face dataset card. https://huggingface.co/datasets/Shengcao1006/MMHal-Bench ↩︎
Zhe Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou, “Hallucination of Multimodal Large Language Models: A Survey,” arXiv:2404.18930, 2024. https://arxiv.org/abs/2404.18930 ↩︎
Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu, “Understanding and Mitigating Hallucination in Large Vision-Language Models via Modular Attribution and Intervention,” ICLR 2025. https://openreview.net/forum?id=Bjq4W7P2Us ↩︎ ↩︎
Xinhao Xu, Hui Chen, Mengyao Lyu, Sicheng Zhao, Yizhe Xiong, Zijia Lin, Jungong Han, and Guiguang Ding, “Mitigating Hallucinations in Multi-modal Large Language Models via Image Token Attention-Guided Decoding,” NAACL 2025. https://aclanthology.org/2025.naacl-long.75/ ↩︎
Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, and Min Zhang, “Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization,” arXiv:2506.04039, 2025. https://arxiv.org/abs/2506.04039 ↩︎

TL;DR for operators#

Inspection photos are not evidence if the model treats them as decoration#

MMHal-Bench is a pressure test, not a census#

The costly idea: hallucination is often a modality-routing problem#

Localization changes the question from blame to repair#

The “Visual Assistant” should be read as a control layer, not a magic chaperone#

Main result, robustness check, and business meaning are not the same thing#

The business value is cheaper diagnosis, not “solved hallucination”#

Where this applies, and where it does not#

Conclusion: make the model prove it saw the thing#