Tunnel Vision: Why Vision-Language Models Still Miss the Bigger Picture

It’s no secret that Vision-Language Models (VLMs) have dazzled us with their prowess—excelling at image captioning, chart understanding, and even medical diagnostics. But beneath the glitter of benchmark wins, a deeper flaw lurks: these models often suffer from what Berman and Deng (Princeton) have sharply diagnosed as “tunnel vision.”

Their new paper, VLMs Have Tunnel Vision, introduces a battery of tasks that humans can breeze through but that leading VLMs—from Gemini 2.5 Pro to Claude Vision 3.7—fail to solve even marginally above chance. These tasks aren’t edge cases or contrived puzzles. They simulate basic human visual competencies like comparing two objects, following a path, and making discrete visual inferences from spatially distributed evidence. The results? A sobering reminder that state-of-the-art perception doesn’t equate to understanding.

The Three Pillars of Visual Reasoning

Berman and Deng break nonlocal visual reasoning into three distinct capabilities:

Task Type	Cognitive Analogy	What It Tests
Comparative Perception	“Do these two faces look alike?”	Holding visual memory and comparing
Saccadic Search	“Where is the red triangle now?”	Iterative, evidence-guided search
Smooth Visual Search	“Trace this wire to its endpoint”	Continuous contour following

Each was tested using synthetic images and minimal-context prompts. The models were evaluated on whether they could succeed in scenarios humans consider trivial—and overwhelmingly, they could not.

When Benchmarks Lie

VLMs routinely dominate on structured benchmarks like AI2D or ChartQA. But the researchers show that these tasks can often be “hacked” using superficial pattern recognition rather than true visual comprehension. For example, a model might learn that a chart label typically appears near a certain axis rather than actually reading it.

This reliance on statistical shortcuts breaks down when the visual structure deviates from expectations—as shown in tasks like:

Object Re-Identification: Can a model recognize that an object in Image 2 is the same as in Image 1 after rotation or translation?
Visual Scavenger Hunt: Can it hop across labeled shapes based on a multi-step sequence?
Circuit Connections: Can it follow a wire across a breadboard without relying on color cues?

In every case, even the best models—Gemini, Claude, o4-mini—struggled to outperform random guessing when visual inference required chaining together spatially separate clues.

Heuristics in Disguise

What’s especially revealing is how the models fail. The study shows that VLMs often substitute heuristics for reasoning:

In Circuit Connections, they don’t trace wires but infer likely endpoints using color similarity.
In Scavenger Hunt, they make the first jump correctly but then hallucinate paths, unable to backtrack or self-correct.
In Object Re-ID, they succeed only when comparisons can be reframed as pixel-matching or verbal description—suggesting they still prefer language space over vision space.

These aren’t just performance flaws. They signal a fundamental architectural bias: modern VLMs are vision wrappers on top of language-first cores. As a result, their “visual reasoning” is often textual reasoning with visual hints.

Why This Matters for Business AI

For anyone building enterprise applications on top of multimodal AI—think warehouse logistics, medical imaging, or autonomous robotics—this paper is a reality check. Your model might ace a Kaggle competition but still fail to:

Recognize the same part across camera angles
Trace a cable in a machinery diagram
Follow a spatial instruction across a GUI

As we noted in a past Cognaptus article on HallusionBench, the danger isn’t just in what models don’t know—but in what they think they know.

Toward Truly Visual Agents

Berman and Deng’s suite is more than an exposé; it’s a blueprint for better models. By isolating tasks that require core visual operations—not just language translation of visual inputs—they pave the way for:

Architectures with internal visual memory and spatial attention chains
Training pipelines that discourage overreliance on text priors
Evaluation methods that look beyond leaderboard scores

At Cognaptus, we see this as a call to action for AI practitioners. If your agent looks but doesn’t see, you’re building intelligence on a fragile foundation.

Cognaptus: Automate the Present, Incubate the Future

The Three Pillars of Visual Reasoning#

When Benchmarks Lie#

Heuristics in Disguise#

Why This Matters for Business AI#

Toward Truly Visual Agents#

The Three Pillars of Visual Reasoning

When Benchmarks Lie

Heuristics in Disguise

Why This Matters for Business AI

Toward Truly Visual Agents