Compositional Generalization

Drawer. That is the easy part. A model sees a drawer, and it knows that drawers are often opened. Then it watches a video where someone is closing the drawer and predicts opening anyway. This is not the kind of error that makes a demo look silly for five seconds and then disappear into the benchmark appendix. It is the kind of error that reveals what the system is really using as evidence. The model is not necessarily watching the motion. It may be recognizing the object, remembering the most common verb attached to that object during training, and calling that “video understanding.” Very efficient. Also wrong. ...