It’s no secret that Vision-Language Models (VLMs) have dazzled us with their prowess—excelling at image captioning, chart understanding, and even medical diagnostics. But beneath the glitter of benchmark wins, a deeper flaw lurks: these models often suffer from what Berman and Deng (Princeton) have sharply diagnosed as “tunnel vision.”
Their new paper, VLMs Have Tunnel Vision, introduces a battery of tasks that humans can breeze through but that leading VLMs—from Gemini 2.5 Pro to Claude Vision 3.7—fail to solve even marginally above chance. These tasks aren’t edge cases or contrived puzzles. They simulate basic human visual competencies like comparing two objects, following a path, and making discrete visual inferences from spatially distributed evidence. The results? A sobering reminder that state-of-the-art perception doesn’t equate to understanding.
The Three Pillars of Visual Reasoning
Berman and Deng break nonlocal visual reasoning into three distinct capabilities:
Task Type | Cognitive Analogy | What It Tests |
---|---|---|
Comparative Perception | “Do these two faces look alike?” | Holding visual memory and comparing |
Saccadic Search | “Where is the red triangle now?” | Iterative, evidence-guided search |
Smooth Visual Search | “Trace this wire to its endpoint” | Continuous contour following |
Each was tested using synthetic images and minimal-context prompts. The models were evaluated on whether they could succeed in scenarios humans consider trivial—and overwhelmingly, they could not.
When Benchmarks Lie
VLMs routinely dominate on structured benchmarks like AI2D or ChartQA. But the researchers show that these tasks can often be “hacked” using superficial pattern recognition rather than true visual comprehension. For example, a model might learn that a chart label typically appears near a certain axis rather than actually reading it.
This reliance on statistical shortcuts breaks down when the visual structure deviates from expectations—as shown in tasks like:
- Object Re-Identification: Can a model recognize that an object in Image 2 is the same as in Image 1 after rotation or translation?
- Visual Scavenger Hunt: Can it hop across labeled shapes based on a multi-step sequence?
- Circuit Connections: Can it follow a wire across a breadboard without relying on color cues?
In every case, even the best models—Gemini, Claude, o4-mini—struggled to outperform random guessing when visual inference required chaining together spatially separate clues.
Heuristics in Disguise
What’s especially revealing is how the models fail. The study shows that VLMs often substitute heuristics for reasoning:
- In Circuit Connections, they don’t trace wires but infer likely endpoints using color similarity.
- In Scavenger Hunt, they make the first jump correctly but then hallucinate paths, unable to backtrack or self-correct.
- In Object Re-ID, they succeed only when comparisons can be reframed as pixel-matching or verbal description—suggesting they still prefer language space over vision space.
These aren’t just performance flaws. They signal a fundamental architectural bias: modern VLMs are vision wrappers on top of language-first cores. As a result, their “visual reasoning” is often textual reasoning with visual hints.
Why This Matters for Business AI
For anyone building enterprise applications on top of multimodal AI—think warehouse logistics, medical imaging, or autonomous robotics—this paper is a reality check. Your model might ace a Kaggle competition but still fail to:
- Recognize the same part across camera angles
- Trace a cable in a machinery diagram
- Follow a spatial instruction across a GUI
As we noted in a past Cognaptus article on HallusionBench, the danger isn’t just in what models don’t know—but in what they think they know.
Toward Truly Visual Agents
Berman and Deng’s suite is more than an exposé; it’s a blueprint for better models. By isolating tasks that require core visual operations—not just language translation of visual inputs—they pave the way for:
- Architectures with internal visual memory and spatial attention chains
- Training pipelines that discourage overreliance on text priors
- Evaluation methods that look beyond leaderboard scores
At Cognaptus, we see this as a call to action for AI practitioners. If your agent looks but doesn’t see, you’re building intelligence on a fragile foundation.
Cognaptus: Automate the Present, Incubate the Future