MLLMs

When Seeing Isn’t Understanding: Closing the Multimodal Generation–Understanding Gap

Image generation has become very good at looking confident. That is convenient for demos, investor decks, and social media clips where a dragon, a dashboard, or a product mockup only needs to survive five seconds of human attention. Unfortunately, enterprise systems are less forgiving. A generated image may be beautiful, on-brand, and still wrong. The product is held in the wrong hand. The safety sign is placed behind the hazard. The chart looks plausible but reverses the relationship it was supposed to explain. Charming, as long as nobody uses it. ...

When Models Get Lost in Space: Why MLLMs Still Fail Geometry

Geometry looks clean. A cube has edges. A projection has rules. A missing view should follow from the views already shown. This is not the messy world of occluded street scenes, motion blur, shadows, or a warehouse camera pointed at the wrong shelf. It is the kind of visual reasoning many students learn before they are trusted with anything more dangerous than a compass, a ruler, and mild boredom. ...

When Agents Start Thinking Twice: Teaching Multimodal AI to Doubt Itself

A model that fails its own eye test Mirror. That is where the problem becomes easy to see. Ask a multimodal model to generate an image of a plush lion toy in front of a mirror. The model may produce something plausible at first glance: lion, mirror, warm lighting, adorable synthetic confidence. Then ask the same model, through its understanding branch, whether the image makes physical sense. Suddenly it notices the issue: if the toy faces the camera, the mirror should mostly show its back, not another front-facing lion. ...

When One Patch Rules Them All: Teaching MLLMs to See What Isn’t There

Image security has an awkward habit of sounding theoretical until the image is inside a business workflow. A product team adds an image-upload feature. A compliance team uses multimodal models to inspect screenshots. A support bot reads photos from customers. A research assistant summarizes figures from PDFs. Everyone understands that the model may occasionally misread an image. That is ordinary error. Annoying, but ordinary. ...

When AI Argues With Itself: Why Self‑Contradiction Is Becoming a Feature, Not a Bug

A model generates an image. Then the same model looks at that image and says, in effect, “No, that is not what the prompt asked for.” Awkward? Yes. Useless? Not necessarily. In normal software engineering, a system contradicting itself is usually a defect report with better manners. In modern AI, especially multimodal systems that both generate and understand images, that contradiction may also be a measurement instrument. The embarrassment is the point. A model that can notice its own generation failed has already exposed a useful asymmetry: its evaluator may be stronger than its producer. ...

Same Content, Different Worlds: Why Multimodal LLMs Still Disagree With Themselves

Screenshot. That is where many business workflows quietly change the problem. A support agent receives a screenshot of a customer bill instead of the billing table as text. A contract review tool receives a scanned clause instead of the clause extracted from the PDF. A procurement assistant receives a rendered purchase order, not the original form fields. Everyone involved assumes the content is the same. The model can read it. The OCR looks correct. The answer should be the same. ...

Seeing Is Believing—Planning Is Not: What SpatialBench Reveals About MLLMs

A robot in a parking lot does not need poetry. It needs to know where the car is, which way the road bends, what happens if it turns right, and how to reach the exit without performing an expensive interpretation of modern sculpture on someone’s bumper. That sounds simple until we ask a multimodal large language model to do it. ...

One Pass to Rule Them All: YOFO and the Rise of Compositional Judging

Search is where nuance goes to die. A customer asks for a long evening dress, preferably not pink. A retrieval model sees “dress,” “evening,” perhaps “pink,” and returns something short, bright, and entirely wrong with the confidence of a clerk who has technically read the sentence but not understood the assignment. The business consequence is familiar: fewer conversions, more irrelevant recommendations, and yet another dashboard where “semantic relevance” looks respectable while customers quietly leave. ...

Mirror, Mirror in the Model: How MLLMs Learn from Their Own Mistakes

TL;DR for operators Image generators fail in a familiar way: the output looks polished, but the prompt was quietly ignored. A product photo misses the specified texture. A campaign image reverses a spatial relation. A science illustration draws the visually plausible version, not the physically correct one. Everyone then discovers, with appropriate corporate surprise, that “high quality” and “correct” are not synonyms. ...