AI Hallucination

TL;DR for operators A multimodal model can receive two exercise videos, describe both convincingly, and still fail to determine which person bent the relevant joint further. Apparently, seeing two videos is not the same as comparing them. A minor distinction, unless the product is marketed as a coach. MotionHalluc tests this gap using 1,540 questions constructed from 553 paired fitness videos. Its most revealing experiment simply reverses the query and reference videos while leaving the proposed corrective instruction unchanged. Several models that perform strongly in the expected order collapse when the order is reversed. LLaVA-OV-1.5-8B, for example, falls from 98.39% accuracy to 1.92%. ...

The camera saw something. The caption invented the rest. A vision-language model looks at a landmark and produces a caption. The caption is fluent. The architecture sounds plausible. The location sounds authoritative. The historical detail has just enough specificity to discourage questions. And that is the problem. In many business settings, a wrong visual description is not wrong in the theatrical way people imagine when they hear “AI hallucination.” It is not a neon giraffe in a board meeting. It is a product listed under the wrong category. A heritage photo tagged with the wrong site. A compliance image described with an unsupported claim. A training material that quietly teaches a false relationship between a place, an object, and its context. ...

AI Hallucination

Swap the Videos, Break the Model

Reasoning in Stereo: Why Vision-Language Models Need Multi‑Hop Sanity Checks