Video Generation

Context Collapse: Why AI’s Next Bottleneck Is Knowing What Matters

TL;DR for operators AI is getting fluent enough to be dangerous in boring ways. It can describe a scene, generate a video, and write a policy memo with impressive confidence. The problem is that real operations rarely fail at the level of generic fluency. They fail when the system confuses which person did what, blends event one into event two, or treats a documented atrocity as a debate club prompt because a user asked for “balance”. ...

Four Bits, One Identity Crisis: What W4A4 Video Quantization Actually Breaks

TL;DR for operators The useful surprise in Tail-Aware HiFloat4 is not that a 4-bit video model gets worse. That part is not exactly a Nobel-level plot twist. The useful surprise is where it gets worse. The paper reports a W4A4 HiFloat4 post-training quantization pipeline for Wan2.2-I2V-A14B, and under matched generation settings the unweighted mean score drops from 0.6800 to 0.5880. But the collapse is concentrated: subject consistency falls from 0.9331 to 0.5324, while aesthetic quality is effectively unchanged, overall consistency is comparable, and motion smoothness drops only slightly from 0.9923 to 0.9803.1 ...

Storyboard, Not Slot Machine: Why AI Video Needs Control Infrastructure

Storyboard, Not Slot Machine: Why AI Video Needs Control Infrastructure Storyboard. That is the easiest way to understand what SmartDirector is trying to bring into AI video generation. Not a better prompt box. Not a prettier demo reel. Not another mystical “cinematic” adjective sprinkled onto a text prompt like cheap paprika. In normal production, a storyboard does two things at once. It specifies visual anchors — who appears, where they stand, what the camera sees — and it controls pacing — when the story moves, when it cuts, when the viewer should notice a change. Current video generation systems are reasonably good at producing attractive short clips, but they are still awkward when a user wants to say: start here, pass through this middle beat, end there, and do not turn my cat into a different cat halfway through the scene. ...

PyraTok: When Video Tokens Finally Learn to Speak Human

Video looks easy until a machine has to remember what matters. A human watches a short clip and immediately separates the important layers: the object, the action, the background, the timing, the implied intent, the scene transition. A model sees a much less polite object: frames, pixels, motion, compression artifacts, and a large bill for GPU memory. Then we ask it to generate video, answer questions, segment objects, localize actions, and preserve meaning across time. Naturally, the model responds by becoming expensive. Very relatable. ...

When Videos Grow Hands: How PhysWorld Teaches Robots to Stop Hallucinating Physics

Robots are not impressed by nice videos. A generated clip can show a hand placing a book into a shelf, pouring tomatoes from a pan, or sweeping scraps into a dustpan. It can look coherent enough to fool a casual viewer and perhaps even a product demo audience, which is not exactly the highest bar in technology. But a robot does not execute “looks coherent.” It executes poses, contacts, forces, trajectories, collisions, and failures. ...