Tunnel Vision: Why Vision-Language Models Still Miss the Bigger Picture
TL;DR for operators A vision-language model can describe an image, answer a chart question, and still fail at the kind of seeing that a bored intern would perform before lunch. That is the operational lesson from Shmuel Berman and Jia Deng’s paper, VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs.1 The paper tests whether leading VLMs can do three basic things: compare two visual objects across an image, follow a sequence of visual clues, and trace a continuous line to its endpoint. Humans find these tasks trivial. Current VLMs do not. ...