Cover image

MI-ZO: Teaching Vision-Language Models Where to Look

Opening — Why this matters now Vision-Language Models (VLMs) are everywhere—judging images, narrating videos, and increasingly acting as reasoning engines layered atop perception. But there is a quiet embarrassment in the room: most state-of-the-art VLMs are trained almost entirely on 2D data, then expected to reason about 3D worlds as if depth, occlusion, and viewpoint were minor details. ...

January 2, 2026 · 4 min · Zelina