Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI
Opening — Why this matters now Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them. This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way. ...