Cover image

Seeing Is Deceiving: Diagnosing and Fixing Hallucinations in Multimodal AI

“I See What I Want to See” Modern multimodal large language models (MLLMs)—like GPT-4V, Gemini, and LLaVA—promise to “understand” images. But what happens when their eyes lie? In many real-world cases, MLLMs generate fluent, plausible-sounding responses that are visually inaccurate or outright hallucinated. That’s a problem not just for safety, but for trust. A new paper titled “Understanding, Localizing, and Mitigating Hallucinations in Multimodal Large Language Models” introduces a systematic approach to this growing issue. It moves beyond just counting hallucinations and instead offers tools to diagnose where they come from—and more importantly, how to fix them. ...

August 5, 2025 · 3 min · Zelina
Cover image

Mirror, Mirror in the Model: How MLLMs Learn from Their Own Mistakes

When multimodal large language models (MLLMs) like Gemini or Janus are asked to generate an image and then assess whether that image matches a prompt, you’d expect agreement. But a new study shows this harmony is often missing: the model’s own understanding branch disagrees with what its generation branch creates. This phenomenon—called self-contradiction—isn’t just an embarrassing quirk. As it turns out, it may be the most valuable feedback signal MLLMs have. ...

July 23, 2025 · 4 min · Zelina