Cover image

Seeing Is Not Thinking: Teaching Multimodal Models Where to Look

Opening — Why this matters now Multimodal models can answer visual questions with alarming confidence. They can also be catastrophically wrong while sounding perfectly reasonable. The uncomfortable truth is that many vision–language models succeed without actually seeing what matters. They talk first. They look later—if at all. The paper behind LaViT puts a name to this failure mode: the Perception Gap. It is the gap between saying the right thing and looking at the right evidence. And once you see it quantified, it becomes hard to ignore. ...

January 18, 2026 · 4 min · Zelina
Cover image

Tunnel Vision: Why Vision-Language Models Still Miss the Bigger Picture

It’s no secret that Vision-Language Models (VLMs) have dazzled us with their prowess—excelling at image captioning, chart understanding, and even medical diagnostics. But beneath the glitter of benchmark wins, a deeper flaw lurks: these models often suffer from what Berman and Deng (Princeton) have sharply diagnosed as “tunnel vision.” Their new paper, VLMs Have Tunnel Vision, introduces a battery of tasks that humans can breeze through but that leading VLMs—from Gemini 2.5 Pro to Claude Vision 3.7—fail to solve even marginally above chance. These tasks aren’t edge cases or contrived puzzles. They simulate basic human visual competencies like comparing two objects, following a path, and making discrete visual inferences from spatially distributed evidence. The results? A sobering reminder that state-of-the-art perception doesn’t equate to understanding. ...

July 21, 2025 · 4 min · Zelina

DeepSeek-V3

A multi-modal foundation model by DeepSeek AI, integrating vision and language for high-performance tasks including OCR, captioning, and visual reasoning.

1 min