Video-Models

A robot in a factory does not need a beautiful video of itself almost doing the job. It needs the gripper to close at the right moment, the wrist to rotate by the right amount, and the next two seconds of motion not to turn a simple pick-and-place task into modern sculpture. This is where many foundation-model stories become less glamorous. Vision-language models can recognize the scene. Video models can imagine motion. Neither of those achievements automatically gives you a usable control policy. ...