Seeing Is Not Solving: Why AI Still Gets Stuck in 3D Worlds
Opening — Why this matters now For the past two years, Vision-Language Models (VLMs) have been quietly promoted as the next step toward generalist agents—systems that can see, reason, and act. The demos are impressive: navigating apps, interpreting screens, even playing games. And yet, place these same models into a messy, real-time 3D environment—and something breaks. ...