When Models Guess the Verb by Looking at the Drawer
Opening — Why this matters now If you have ever watched a video model confidently predict opening drawer when the person is clearly closing it, you have already encountered the core problem of modern compositional video understanding: the model isn’t really watching the action. It is guessing. As video models are increasingly deployed in robotics, industrial monitoring, and human–AI interaction, the ability to correctly generalize unseen verb–object combinations is no longer academic. A robot that confuses opening with closing is not merely inaccurate—it is dangerous. ...