Cover image

PyraTok: When Video Tokens Finally Learn to Speak Human

Opening — Why this matters now Text-to-video models are scaling at an alarming pace. Resolution is no longer the bottleneck—semantic fidelity is. As generators push into 4K and even 8K regimes, a quieter but more consequential problem emerges underneath: the tokenizer. If visual tokens do not align with language, no amount of diffusion steps will save downstream reasoning, control, or zero-shot transfer. ...

January 24, 2026 · 3 min · Zelina
Cover image

When Models Guess the Verb by Looking at the Drawer

Opening — Why this matters now If you have ever watched a video model confidently predict opening drawer when the person is clearly closing it, you have already encountered the core problem of modern compositional video understanding: the model isn’t really watching the action. It is guessing. As video models are increasingly deployed in robotics, industrial monitoring, and human–AI interaction, the ability to correctly generalize unseen verb–object combinations is no longer academic. A robot that confuses opening with closing is not merely inaccurate—it is dangerous. ...

January 24, 2026 · 4 min · Zelina
Cover image

When One Clip Isn’t Enough: Teaching LLMs to Watch Long Videos Like Adults

Opening — Why this matters now Large language models have learned to see. Unfortunately, they still have the attention span of a distracted intern when the video runs longer than a minute. As multimodal LLMs expand their context windows and promise “end-to-end” video understanding, a hard reality remains: long videos are not just longer inputs—they are fundamentally different reasoning problems. Information is sparse, temporally distant, multimodal, and often only meaningful when grounded precisely in time and space. Compress everything up front, and you lose the evidence. Don’t compress, and you blow the context budget. ...

December 24, 2025 · 4 min · Zelina