PyraTok: When Video Tokens Finally Learn to Speak Human
Opening — Why this matters now Text-to-video models are scaling at an alarming pace. Resolution is no longer the bottleneck—semantic fidelity is. As generators push into 4K and even 8K regimes, a quieter but more consequential problem emerges underneath: the tokenizer. If visual tokens do not align with language, no amount of diffusion steps will save downstream reasoning, control, or zero-shot transfer. ...