Visual Reasoning

Edit, Actually: Why Visual AI Needs Evidence, Not Eye Candy

A dashboard is rarely confusing because the pixels are ugly. More often, the problem is that the important part is small, crowded, rotated, hidden in a chart corner, split across spatial relations, or buried inside a scene that needs to be mentally transformed before the answer becomes obvious. A human analyst zooms, marks, traces, rearranges, or imagines a new angle. A multimodal model, by contrast, is often asked to stare at the original image and talk harder. ...

Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

A model can generate a pretty sequence of images. Good. So can a slide deck. The harder question is whether those images actually help it think. That is the uncomfortable point behind MentisOculi: Revealing the Limits of Reasoning with Mental Imagery, a new benchmark paper that tests whether frontier multimodal models can do something closer to human mental imagery: form a visual state, keep it stable, transform it step by step, and use the transformed state to decide what to do next.1 Not merely “look at an image and answer a question.” Not “draw a plausible intermediate picture.” Actual visual reasoning, with consequences. ...

Thinking in Panels: Why Comics Might Beat Video for Multimodal Reasoning

A dashboard screenshot is often too little. A video walkthrough is often too much. Somewhere between the two sits a strangely old-fashioned interface: panels, captions, arrows, speech bubbles, and a sequence that tells the machine what happened before what. Yes, comics. That sounds unserious only if we think comics are a decoration layer: something added after the reasoning is complete to make the output friendlier. The paper Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling makes a more interesting claim: comics can act as the reasoning medium itself, not merely the illustration of reasoning after the fact.1 ...

Seeing Is Not Thinking: Teaching Multimodal Models Where to Look

A model can see the image and still miss the point Inspection is a wonderfully cruel test for AI. Show a multimodal model a product photo, a medical scan, a factory defect, a form, or a dashboard screenshot, and the answer may sound calm, fluent, and technically plausible. The model may even imitate the reasoning style of a stronger teacher model. It may describe objects, infer relationships, and produce the correct-looking sentence. ...

Ground and Pound: How Iterative Reasoning Quietly Redefines GUI Grounding

Clicks Are Cheap. Wrong Clicks Are Not. Click. That is the unit where many AI agent demos stop being impressive and start becoming expensive. A planning model can write a beautiful instruction sequence: open the settings panel, choose the correct tab, find the export button, confirm the dialog. Lovely. Then the visual grounding model clicks the button two pixels away from the actual target, or chooses the visually similar icon beside it, or mistakes a disabled control for an active one. Suddenly the “agentic workflow” is not a workflow. It is a small robot poking the wrong part of a screen with great confidence. Very modern. Very avoidable, perhaps. ...

Mind the Markov Gap: How a Lightweight Agent Outsmarts Heavy LLMs in Open-Vocabulary Vision

A camera on a factory line does not need to write an essay before deciding whether a part is cracked. That sounds obvious. Yet a surprising amount of recent AI architecture quietly assumes the opposite: when vision systems become uncertain, bring in a large language model, ask it to generate richer descriptions, then run the detector again. Sometimes this works. It also turns a detection problem into a small committee meeting, and committee meetings are rarely known for real-time throughput. ...

Memory, But Make It Multimodal: How ViLoMem Rewires Agentic Learning

Memory is easy to oversell. Give an AI agent a database, a longer context window, and a few inspirational phrases about “learning from experience,” and suddenly everyone in the room starts talking as if the system has developed institutional wisdom. It has not. At best, it has a slightly more organized attic. ...

Tunnel Vision: Why Vision-Language Models Still Miss the Bigger Picture

TL;DR for operators A vision-language model can describe an image, answer a chart question, and still fail at the kind of seeing that a bored intern would perform before lunch. That is the operational lesson from Shmuel Berman and Jia Deng’s paper, VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs.1 The paper tests whether leading VLMs can do three basic things: compare two visual objects across an image, follow a sequence of visual clues, and trace a continuous line to its endpoint. Humans find these tasks trivial. Current VLMs do not. ...

DeepSeek-V3

A multi-modal foundation model by DeepSeek AI, integrating vision and language for high-performance tasks including OCR, captioning, and visual reasoning.