When One Clip Isn’t Enough: Teaching LLMs to Watch Long Videos Like Adults
Opening — Why this matters now Large language models have learned to see. Unfortunately, they still have the attention span of a distracted intern when the video runs longer than a minute. As multimodal LLMs expand their context windows and promise “end-to-end” video understanding, a hard reality remains: long videos are not just longer inputs—they are fundamentally different reasoning problems. Information is sparse, temporally distant, multimodal, and often only meaningful when grounded precisely in time and space. Compress everything up front, and you lose the evidence. Don’t compress, and you blow the context budget. ...