TL;DR for operators
A vision-language model can describe an image, answer a chart question, and still fail at the kind of seeing that a bored intern would perform before lunch.
That is the operational lesson from Shmuel Berman and Jia Deng’s paper, VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs.1 The paper tests whether leading VLMs can do three basic things: compare two visual objects across an image, follow a sequence of visual clues, and trace a continuous line to its endpoint. Humans find these tasks trivial. Current VLMs do not.
The important point is not “models are bad at vision.” That would be too easy, and therefore suspiciously comforting. The sharper finding is that VLMs often see locally, translate what they can into language-like fragments, and then reason over those fragments. When the task demands a visual routine that cannot be conveniently converted into text — compare this object, jump to that clue, follow this wire — performance collapses or becomes strongly dependent on shortcuts.
For business teams building visual agents, the implication is direct:
| Operator question | What the paper suggests |
|---|---|
| Can a VLM inspect whether two parts are the same across angles? | Not reliably without task-specific checks, decomposition, or verification. |
| Can it follow a spatial instruction across a dashboard or UI? | Some models can make individual jumps, but autonomous multi-step visual search remains brittle. |
| Can it trace a cable, line, route, or diagram path? | This is one of the hardest cases; models often use colour or proximity instead of true tracing. |
| Can benchmark strength substitute for workflow testing? | No. Strong multimodal scores do not imply reliable nonlocal visual reasoning. |
| What should deployment teams do? | Treat visual reasoning as a procedure, not a prompt. Break it into auditable steps, use specialised vision tools, and keep verification where the visual chain matters. |
The paper’s benchmark is synthetic, so it should not be read as a direct forecast for every factory floor, claims desk, hospital, or engineering workflow. But synthetic is not the same as irrelevant. The point of these tasks is to remove domain knowledge, layout convention, and semantic guessing. What remains is the visual machinery itself. And that machinery is still rather less impressive than the demo reel would prefer.
The real weakness is not eyesight; it is visual procedure
Most business conversations about multimodal AI quietly assume that vision is a single capability. A model either “understands images” or it does not. That is convenient, procurement-friendly, and wrong.
Berman and Deng split the problem into three visual routines:
- Comparative perception: holding two visual objects in working memory and judging whether they are the same under allowed transformations.
- Saccadic search: making discrete, evidence-driven jumps across an image, where each observation tells the model where to look next.
- Smooth visual search: tracing a continuous contour, such as a wire, line, boundary, or route.
This distinction matters because enterprise vision tasks rarely consist of a single glance. A quality-control check may require comparing a part against a template. A dashboard assistant may need to read one label, jump to a corresponding chart element, then check an axis. A maintenance agent may need to trace a wire from a terminal to a component. These are not just “image questions.” They are visual procedures.
That is the mechanism-first reading of the paper: VLMs often do not fail because the image is obscure. They fail because the task requires a sequence of visual operations that must remain grounded in the image.
The paper’s three task families are deliberately simple:
| Task | Visual routine being tested | Main evidence or diagnostic role |
|---|---|---|
| Object Re-Identification | Comparative perception | Main evidence for whether models can compare structured objects across transformations. |
| Visual Scavenger Hunt | Saccadic search | Main evidence for whether models can make sequential visual jumps. |
| Circuit Connections | Smooth visual search | Main evidence for whether models can trace a continuous path. |
| Object Re-ID variants | Binding and shortcut diagnosis | Diagnostic variants testing whether performance changes when connectedness or pixel matching changes. |
| Decomposed Scavenger Hunt | Ablation of autonomous chaining | Tests whether failure comes from poor single-step perception or poor multi-step control. |
| Circuit colour variants | Shortcut and sensitivity tests | Test whether models trace wires or exploit colour/proximity cues. |
| Log-odds analysis on Circuit Connections | Failure-mode analysis | Checks whether distance and crossings affect success in ways consistent with tracing. |
| Appendix prompts and examples | Implementation detail | Clarify task construction and prompt conditions; useful for interpreting results but not a second thesis. |
This is why the paper is more useful than another leaderboard scolding. It does not merely say that models miss pixels. It asks whether they can run the visual algorithms that humans run almost automatically.
The answer is: not reliably. A pity, really. The machines had one job, and it involved looking.
Comparative perception fails when “similar enough” becomes good enough
The Object Re-Identification task asks the model to decide whether an object in one image appears in another image. The object may be rotated, translated, or scaled as a whole. In negative examples, one or more component shapes are changed independently, making the object structurally different. Distractors may appear in the second image, but they do not occlude the target.
This is a clean test of visual comparison. The model is not being asked to recognise a cat, infer a social scene, or know what a resistor does. It only needs to compare shapes.
The paper uses three variants:
| Variant | Likely purpose | What it tests |
|---|---|---|
| Standard | Main comparative-perception test | Can models compare connected composite objects under transformation? |
| Unconnected | Diagnostic variant | Does removing physical connectedness make the object easier or harder to compare? |
| Pixel-Perfect | Shortcut/sensitivity variant | Can models exploit exact matching when the target is not transformed as a whole? |
The result is awkward for the “VLMs basically see like humans now” crowd. On the Standard variant, no model significantly outperforms random chance. The highest reported score is 60%. Humans scored 100% on manually evaluated Object Re-Identification examples.
The diagnostic variants make the failure more interesting. Stronger models improve on Unconnected and Pixel-Perfect versions. The paper reports that GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4 score 77%, 65%, and 79% respectively on the Unconnected variant, and 71%, 65%, and 70% on Pixel-Perfect. Those are real improvements, but still more than 20 percentage points below the human baseline.
The mechanism is the point. Connected objects should be easier for humans because human perception naturally groups connected parts into a whole. For these models, connectedness appears to create the opposite problem. When the objects look globally similar, the model may not inspect the component-level details carefully enough. It seems to accept “similar-looking composite object” as a sufficient answer.
That is not a small issue for business use. Many operational inspections depend on recognising that two things are nearly the same but not actually the same:
| Business task | Why this result matters |
|---|---|
| Part matching across camera angles | A model may accept a near-match if the whole object looks plausible. |
| Document layout comparison | It may miss local structural differences when the global layout is familiar. |
| Product defect review | It may describe the object correctly while missing a component-level deviation. |
| UI regression testing | It may recognise the screen but fail to compare exact local changes. |
The paper’s interpretation is not that these models have no visual acuity. The sharper claim is that they do not consistently choose to spend that acuity. They inspect when the task becomes easier to convert into a discrete comparison, and skim when the image feels familiar enough.
In operations, that is dangerous because the model will not politely announce, “I am now skimming.” It will simply answer.
Saccadic search shows the difference between one good look and a visual workflow
The Visual Scavenger Hunt task is more procedural. The model sees a grid of coloured shapes. Each tile has a label pointing to another colour-shape pair. Starting from a given shape, the model must follow the labels for two, three, or four steps and return the final colour.
This tests saccadic search: the ability to jump from one region to another based on evidence collected along the way. It is the visual equivalent of “read this clue, find that object, read the next clue, repeat.”
The random baseline is about 9%, because there are 11 possible colours. Humans scored 100%. Only GPT-5, o4-mini, and Gemini 2.5 Pro significantly outperform random chance. Most other models hover near guessing, with some marginal gains that the authors suggest may come from simple heuristics such as choosing frequent colours.
The most useful part of this experiment is the follow-up decomposition. The authors split a chain-length-3 task into three sequential one-step queries. The model’s answer to one step becomes the starting point for the next. This is an ablation: it tests whether models fail because they cannot perceive a single step, or because they cannot manage the multi-step visual routine autonomously.
The results separate the field cleanly:
| Model | Decomposed three-step accuracy | Final-step error rate | Interpretation |
|---|---|---|---|
| o4-mini | 90.67% | 6% | Strong at guided atomic visual steps. |
| Gemini 2.5 Pro | 88.00% | 8% | Strong at guided atomic visual steps. |
| Llama 3.2 Vision 11B | 24.00% | 85% | Poor atomic perception in this task. |
| Qwen 2.5 VL 32B | 16.00% | 86% | Poor atomic perception in this task. |
| Qwen 2.5 VL 7B | 14.67% | 85% | Poor atomic perception in this task. |
For the top models, the bottleneck is not simply seeing a tile. They can execute the atomic step when guided. The failure appears when the model must autonomously chain the steps inside one visual task. The paper notes that simple error multiplication would predict o4-mini should achieve around 83% if its guided single-step ability transferred cleanly to autonomous multi-step search. Instead, its observed chain-length-3 performance is 44.5%.
That gap is the business lesson.
A model may be excellent when a workflow forces it to look in exactly the right place at exactly the right time. The same model may become brittle when it must decide where to look next without external scaffolding.
This matters for visual agents because many enterprise workflows look like scavenger hunts disguised as software:
- read the warning indicator, then find the matching machine zone;
- locate the chart legend, then find the corresponding line, then read the axis;
- inspect a form field, then follow an error message to the affected section;
- identify a component label, then find the matching item in a diagram;
- follow a UI instruction from one panel to another.
The decomposed experiment suggests a practical design rule: do not ask the model to “look around and figure it out” when the visual chain is important. Turn the visual chain into explicit steps. Capture intermediate outputs. Verify each transition. Yes, that is less glamorous than a single magic prompt. It also has the minor advantage of being less likely to fail silently.
The self-correction problem is worse than ordinary error accumulation
The Scavenger Hunt task includes an interesting property: labels are random, so about 30% of mistaken paths lead to a shape that does not exist on the grid. For a human, that is a recovery signal. If the next target is absent, you probably made an earlier mistake.
The models do not seem to use that signal systematically.
This is not just “multi-step tasks accumulate errors.” Everyone knows that. The deeper issue is that the model receives evidence that its visual path has gone wrong and still does not reliably backtrack.
That distinction matters operationally. In real workflows, visual agents will often hit contradictions:
- a referenced UI element is missing;
- a cable appears to terminate nowhere;
- a form instruction points to a field that is not visible;
- a chart label does not match the apparent plotted line;
- a part number in the diagram does not correspond to the detected part.
A robust visual agent should treat such contradictions as a reason to re-check prior visual assumptions. The paper’s qualitative analysis suggests current models often hallucinate paths, cite non-existent shapes, or continue despite earlier confusion. That behaviour is manageable in a toy benchmark. In a maintenance workflow, it becomes expensive. In a safety workflow, it becomes the sort of thing that makes auditors develop facial twitches.
Smooth visual search is where language-shaped vision runs out of road
Circuit Connections is the paper’s hardest and most revealing task. The model sees a synthetic breadboard-style diagram and must identify which component a given port connects to. To answer correctly, it must trace a wire from source to endpoint.
The task has three variants:
| Variant | Likely purpose | What it reveals |
|---|---|---|
| Standard | Main smooth-search test | Can models trace wires when colours may repeat? |
| Single Color | Shortcut-prevention test | Can models trace when colour cues are removed? |
| Unique Colors | Shortcut-enabling control | How much performance improves when endpoints can be matched by colour? |
The random baseline is 14.29%, based on the number of possible components. Humans scored 99.5%. All tested VLMs struggle with smooth visual search. The hardest setting is Single Color, where the best reported accuracy is only 27%, achieved by Gemini 2.5 Pro.
The colour variants expose the shortcut problem. If every wire has a unique colour, a model can often avoid tracing by matching endpoint colours. If all wires share one colour, that trick disappears. Gemini 2.5 Pro drops from 48% on Unique Colors to 27% on Single Color. The improvement under Unique Colors is useful, but it is not evidence of human-like tracing. It is evidence that a shortcut has been made available and the model knows how to take it. Congratulations to the model: it has discovered office politics.
The paper’s log-odds analysis adds another diagnostic layer. If models were tracing wires like humans, wire crossings should matter because crossings create visual ambiguity. Distance should also matter, but not merely as a crude proximity cue. The authors find that distance is a strong negative predictor across many settings, while crossing effects are most meaningful among the stronger models, especially in the Single Color condition.
This supports a nuanced interpretation. Most models do not appear to trace lines at all. The stronger models may perform something closer to a choppy sequence of visual jumps along the wire, rather than smooth contour tracing. That would explain why crossings interfere and why performance remains far below human levels.
For enterprise AI, this is the section that should make diagram-heavy automation teams sit up. Cable tracing, process-flow reading, route following, schematic interpretation, supply-chain path diagrams, network maps, and UI connector diagrams all rely on smooth or near-smooth visual search. The model may understand the legend. It may identify components. It may describe the diagram. Then it may still fail to follow the line.
That is not a semantic misunderstanding. It is a missing visual routine.
Benchmark success can hide dependence on conventions
The paper begins from a tension: VLMs perform well on many high-level multimodal benchmarks, including diagram and chart understanding benchmarks, yet struggle on primitive visual reasoning tests. Berman and Deng argue that many tasks can be solved by extracting local visual facts into text and then doing the reasoning in language space.
That is not inherently bad. If the task is convention-heavy, the shortcut may work. Charts often have axes, legends, labels, and familiar layouts. Diagrams often follow common visual grammar. A model can exploit those regularities without building a robust visual procedure.
The problem appears when the image stops obeying familiar conventions or when the answer requires nonlocal evidence that cannot be compressed into a few textual facts.
This is the misconception the article needs to kill cleanly: strong performance on VQA, chart, or document benchmarks does not mean the model can inspect, compare, search, and trace visual evidence like a human. It may mean the model has learned to use just enough visual extraction to feed a very capable language model.
That difference is not academic. It changes how deployments should be tested.
| Common evaluation habit | Better evaluation question |
|---|---|
| “Can the model answer questions about our images?” | Can it execute the visual routine our workflow requires? |
| “Does it describe the image correctly?” | Does it compare the right regions and preserve the relevant differences? |
| “Can it read the chart?” | Can it trace from legend to line to axis under layout variation? |
| “Can it identify components?” | Can it follow relationships between components? |
| “Did it get the final answer right?” | Were the intermediate visual steps grounded and recoverable? |
The paper’s synthetic design is useful precisely because it removes many familiar crutches. There is little world knowledge to use. The shapes are simple. The prompts are minimal. The tasks are visually easy for humans. If a model fails here, the failure is hard to excuse as domain complexity.
What Cognaptus infers for business use
The paper directly shows that leading VLMs struggle with nonlocal visual reasoning in controlled synthetic tasks. Cognaptus infers that businesses should be careful when deploying general VLMs into workflows requiring visual comparison, multi-step visual search, or path tracing.
Those are not the same claim. The first is evidence. The second is operational interpretation.
The practical implication is not “do not use VLMs for visual work.” That would be lazy. The better conclusion is that VLMs should be embedded inside visual systems that force procedure, expose intermediate states, and verify brittle steps.
A useful deployment pattern looks like this:
| Visual workflow risk | Safer design pattern |
|---|---|
| The task requires comparing two regions | Crop or isolate both regions, run explicit comparison, and use deterministic difference checks where possible. |
| The task requires multiple visual jumps | Convert the workflow into stepwise calls with logged intermediate targets. |
| The task requires tracing a line or path | Use specialised computer-vision or graph-extraction tools before asking the VLM to reason. |
| The task has high cost of false confidence | Require human review or rule-based validation at the final decision point. |
| The image type has strong conventions | Test on convention-breaking examples, not only ordinary samples. |
| The model gives fluent rationales | Treat rationales as hypotheses, not evidence, unless intermediate visual grounding is verified. |
This is especially relevant in four categories of enterprise use.
Inspection and quality assurance
Object comparison is not optional in inspection. The model must notice when a component has been shifted, rotated, omitted, or substituted. The Object Re-Identification results suggest that global similarity can mask local differences. For QA workflows, a VLM should not be the only comparator. It should be paired with segmentation, template matching, geometric checks, or structured inspection protocols.
UI and software testing
UI testing often involves nonlocal visual reasoning: find a button, follow an instruction, compare a modal to a reference, check whether a layout changed. A general VLM may recognise the screen and still miss the exact visual difference. Stepwise grounding and screenshot-region comparison are safer than broad “does this look right?” prompts.
Diagram and chart review
Chart and diagram understanding is precisely where benchmark confidence can mislead. Many charts are solvable by conventions. Real business charts, regrettably, are designed by humans, which means they often contain strange layouts, overloaded legends, inconsistent colours, and crimes against spacing. If a model must trace a line, match a legend, or follow a flow, test that routine directly.
Maintenance and engineering workflows
Circuit tracing is not a toy problem if your workflow involves cables, pipes, process diagrams, network maps, or wiring schematics. The paper’s Circuit Connections task shows that models may use endpoint proximity or colour rather than path continuity. In maintenance settings, that distinction is not philosophical. It is the difference between finding the right component and confidently pointing at the wrong one.
The limitations are real, but they do not rescue the models
The paper’s main limitation is that the benchmark is synthetic. Synthetic tasks isolate primitive visual reasoning, but they do not capture the full range of natural images. Some natural-image tasks can be solved by skimming, using semantic priors, or relying on familiar layout. That is not cheating; it is often efficient.
The evaluation is also cost-limited. The authors use 200 or 125 examples per variant, depending on the task. That is enough to reveal strong patterns, but not enough to map every model’s full capability envelope under every prompting, cropping, tool-use, or agentic configuration.
There is another boundary: the benchmark evaluates VLMs in the tested settings, not every possible multimodal system. A system with explicit visual tools, segmentation, graph extraction, iterative zooming, or programmatic image operations might perform better. In fact, that is part of the point. The paper is strongest as an argument against treating the base VLM as if it already contains all necessary visual routines.
The limitations therefore narrow the interpretation, but they do not dissolve it.
The right reading is:
| Claim | Status |
|---|---|
| Current tested VLMs fail certain controlled nonlocal visual reasoning tasks. | Directly shown by the paper. |
| Strong benchmark scores do not guarantee human-like visual routines. | Strongly supported by the paper’s task design and results. |
| Enterprise workflows requiring comparison, search, or tracing need extra safeguards. | Business inference from the evidence. |
| VLMs cannot be useful for visual enterprise work. | Not shown, and not the right conclusion. |
| Tool-augmented visual agents may overcome some failures. | Plausible, but outside the paper’s direct evidence. |
This is the rare limitation section with good news: the fix is not mystical. The paper does not imply that teams must wait for a new theory of consciousness, a trillion more parameters, or a motivational poster about multimodal alignment. It implies that visual workflows should be engineered as workflows.
The future visual agent will need eyes, memory, and procedure
The most useful mental model from this paper is that VLMs are not blind. They are visually under-procedural.
They can extract. They can describe. They can sometimes compare. They can sometimes make a visual jump. But when the task requires sustained grounding across multiple regions, they often fall back to shortcuts: global similarity, colour cues, proximity, natural-language descriptions, or hallucinated intermediate steps.
That is why the mechanism-first framing matters. The failure is not a single benchmark weakness. It is a missing family of visual routines.
For operators, the lesson is simple enough to be annoying: do not ask a general VLM to perform an unverified visual procedure and then treat the final answer as inspection-grade evidence. Build the procedure around it. Force the intermediate looks. Use specialised tools for geometry and tracing. Test on cases that break familiar conventions. And when the task is costly, keep a human or deterministic validator in the loop.
The bigger picture is that multimodal AI is moving from image answering to visual agency. Agents will not merely caption screenshots. They will inspect, navigate, compare, repair, and operate. That shift makes nonlocal visual reasoning central.
A model that can identify every object in a room but cannot follow the cable between them has not solved vision. It has solved inventory.
Useful, yes. But let us not confuse the stockroom with the nervous system.
Cognaptus: Automate the Present, Incubate the Future.
-
Shmuel Berman and Jia Deng, “VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs,” arXiv:2507.13361, 2025, https://arxiv.org/abs/2507.13361. ↩︎