Video LLMs

A car approaches a crosswalk. The frames look simple: car, road, direction, movement. A human can still ask the useful question: is the car speeding up, slowing down, or merely moving at a steady pace? A video language model may answer with the confidence of a dashboard camera that has read too many captions and learned too little physics. It sees a car getting closer. It infers “accelerating.” The problem is not that the model missed the car. The problem is that it saw the same visual pattern and failed to model the hidden change in motion. ...