Multimodal AI

Click Like a Human: Why Avenir-Web Is a Quiet Breakthrough in Web Agents

Click. That is where most web-agent demos become either impressive or mildly tragic. The model reads the instruction, understands the goal, produces a confident plan, and then clicks the wrong thing. Or it clicks the right thing before a modal appears. Or it scrolls, forgets why it scrolled, repeats an action, and quietly turns a three-step workflow into interpretive dance. ...

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

A model can generate a pretty sequence of images. Good. So can a slide deck. The harder question is whether those images actually help it think. That is the uncomfortable point behind MentisOculi: Revealing the Limits of Reasoning with Mental Imagery, a new benchmark paper that tests whether frontier multimodal models can do something closer to human mental imagery: form a visual state, keep it stable, transform it step by step, and use the transformed state to decide what to do next.1 Not merely “look at an image and answer a question.” Not “draw a plausible intermediate picture.” Actual visual reasoning, with consequences. ...

Thinking in Panels: Why Comics Might Beat Video for Multimodal Reasoning

A dashboard screenshot is often too little. A video walkthrough is often too much. Somewhere between the two sits a strangely old-fashioned interface: panels, captions, arrows, speech bubbles, and a sequence that tells the machine what happened before what. Yes, comics. That sounds unserious only if we think comics are a decoration layer: something added after the reasoning is complete to make the output friendlier. The paper Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling makes a more interesting claim: comics can act as the reasoning medium itself, not merely the illustration of reasoning after the fact.1 ...

When Language Learns to Doubt Itself: Self-Contradiction as an Upgrade Path for Multimodal AI

Image generation has become good enough to be useful and unreliable enough to remain annoying. That is the normal condition of enterprise AI: impressive demos, awkward edge cases, and someone in operations quietly asking whether the model actually understood the instruction or merely produced something that looked plausible from a distance. A user asks for “a red ceramic mug on a wooden desk, next to an open notebook, in morning light.” The model produces a beautiful desk, credible sunlight, maybe even the notebook. The mug is blue. Or metallic. Or missing. If a separate vision model can look at the image and say, “That is not a red ceramic mug,” the failure feels almost rude. The system can see the problem after creating it. Very efficient, in the same way that a committee can discover a typo after approving the brochure. ...

Seeing Is Thinking: When Images Do the Reasoning

Paper is a good trap for artificial intelligence. Fold it, punch it, unfold it, and ask where the holes are. A person may not solve the problem instantly, but the mind knows what to do: imagine the folded sheet opening step by step. The reasoning is not mainly verbal. We do not narrate every cell of the paper grid like a bored accountant reading inventory codes. We see the transformation. ...

When Models Listen but Stop Thinking: Teaching Audio Models to Reason Like They Read

A voice assistant can transcribe your question correctly and still answer like it heard something else. That is the awkward part of modern audio-language models. The obvious diagnosis is usually “better speech recognition.” The less obvious diagnosis is nastier: the model may receive an audio input that is semantically equivalent to the text prompt, but once generation begins, its audio-conditioned reasoning trajectory drifts away from the reasoning trajectory it would have followed if the same question had been typed. ...

PyraTok: When Video Tokens Finally Learn to Speak Human

Video looks easy until a machine has to remember what matters. A human watches a short clip and immediately separates the important layers: the object, the action, the background, the timing, the implied intent, the scene transition. A model sees a much less polite object: frames, pixels, motion, compression artifacts, and a large bill for GPU memory. Then we ask it to generate video, answer questions, segment objects, localize actions, and preserve meaning across time. Naturally, the model responds by becoming expensive. Very relatable. ...

Seeing Is Misleading: When Climate Images Need Receipts

A picture lies differently from a sentence. A sentence can be checked against a source. A picture can be old, cropped, staged, reused, mislabeled, emotionally loaded, or paired with a claim it never supported. This is why climate disinformation is annoying in the precise technical sense: it often does not need to fabricate a new fact. It can simply attach a real-looking image to a slippery claim and let the audience do the rest. Very efficient. Very human. Very platform-native. ...

Fish in the Ocean, Not Needles in the Haystack

Documents are where confident AI demos go to become slightly embarrassing. A model reads a long report. It gives the right answer. The room relaxes. Someone says “great, it understood the document,” and everyone pretends the word understood has not just been smuggled into the meeting without a passport. That is the exact mistake SIN-Bench is designed to catch.1 The paper is not merely another benchmark asking whether multimodal large language models can answer questions about scientific literature. It asks a more operationally painful question: can the model show the evidence path that makes the answer legitimate? ...

Seeing Is Not Thinking: Teaching Multimodal Models Where to Look

A model can see the image and still miss the point Inspection is a wonderfully cruel test for AI. Show a multimodal model a product photo, a medical scan, a factory defect, a form, or a dashboard screenshot, and the answer may sound calm, fluent, and technically plausible. The model may even imitate the reasoning style of a stronger teacher model. It may describe objects, infer relationships, and produce the correct-looking sentence. ...