MI-ZO: Teaching Vision-Language Models Where to Look
Camera placement is an unglamorous way to lose an AI project. A vision-language model may recognize doors, ladders, rocks, chairs, and surface textures perfectly well in ordinary images. Point the camera at the wrong side of an object, however, and the relevant feature disappears. Show the model eight similarly unhelpful views and it has received more data without receiving more evidence. ...