Active Perception

GUI-Eyes: When Agents Learn Where to Look

Screenshots look simple until they are not. A human opening a dense professional application does not inspect every pixel with equal seriousness. We glance, zoom in mentally, ignore decorative clutter, search for the likely region, then focus. In other words, we do not merely “see” the interface. We decide where to look. ...

MI-ZO: Teaching Vision-Language Models Where to Look

Camera placement is an unglamorous way to lose an AI project. A vision-language model may recognize doors, ladders, rocks, chairs, and surface textures perfectly well in ordinary images. Point the camera at the wrong side of an object, however, and the relevant feature disappears. Show the model eight similarly unhelpful views and it has received more data without receiving more evidence. ...