3D Vision

Photon or Not: When AI Learns to See in 3D Without Burning Your GPU

CT scans are not photographs. This is a small fact with expensive consequences. A normal image model can pretend that visual understanding is mostly a matter of looking at a flat picture. A CT volume does not offer that courtesy. It is dense, three-dimensional, and full of clinically relevant details that may occupy only a small part of the scan. Feed the whole thing into a multimodal large language model, and the model faces a choice: compress the volume aggressively, sample a few slices, or ask the GPU to become a radiologist with a power bill. ...

When Robots Guess, People Bleed: Teaching AI to Say ‘This Is Ambiguous’

Vial. That is the easy version of the problem. A robot stands near a surgical tray. A person says, “Pass me the vial.” There are two vials. One is harmless. One is not. The robot does not need a better smile, a warmer voice, or a more fluent explanation of how helpful it intends to be. It needs to know that the instruction should not be executed yet. ...

MI-ZO: Teaching Vision-Language Models Where to Look

Camera placement is an unglamorous way to lose an AI project. A vision-language model may recognize doors, ladders, rocks, chairs, and surface textures perfectly well in ordinary images. Point the camera at the wrong side of an object, however, and the relevant feature disappears. Show the model eight similarly unhelpful views and it has received more data without receiving more evidence. ...