Tunnel Vision, Literally: When Cropping Makes Multimodal Models Blind
A receipt is not hard to understand because it is philosophical. It is hard because the answer may live in one corner, the label in another, and the meaning in the relationship between them. That is exactly the kind of thing multimodal large language models are supposed to be getting better at. Give the model an image. Ask a question. Let the model inspect the pixels and reason over the scene. The product demo looks magical until the model reads the wrong number, misses the column header, confuses the parking space for a lane, or confidently answers a chart question from the wrong local patch. Then the magic becomes a support ticket. ...