LVLMs

Sight Unseen: How LVLM Alignment Can Teach Models to Ignore Images Image inspection has one rude requirement: the model should look at the image. That sounds too obvious to be an article thesis, which is usually a warning sign. In real deployments, a large vision-language model may describe a damaged package, summarize a product photo, inspect a dashboard screenshot, answer a question about an invoice, or guide a visual agent through a web interface. When it gets something wrong, the default diagnosis is familiar: the vision encoder missed the object, the dataset was noisy, the benchmark was weak, or the model simply hallucinated because models hallucinate. Very tidy. Also incomplete. ...