Cosmos Policy: When Video Models Stop Watching and Start Acting

A robot in a factory does not need a beautiful video of itself almost doing the job.

It needs the gripper to close at the right moment, the wrist to rotate by the right amount, and the next two seconds of motion not to turn a simple pick-and-place task into modern sculpture. This is where many foundation-model stories become less glamorous. Vision-language models can recognize the scene. Video models can imagine motion. Neither of those achievements automatically gives you a usable control policy.

Cosmos Policy is interesting because it does not treat that gap as a reason to bolt yet another action head onto a large model. It asks a more direct question: what if a video diffusion model could treat robot actions, proprioception, future observations, and values as if they were all just additional latent “frames” in the same sequence?¹

That is the paper’s central move. Not the benchmark leaderboard, although the numbers are good. Not the “world model” label, although that part matters. The real mechanism is latent frame injection: a way to make a pretrained video model stop being merely a watcher of motion and start becoming a generator of action.

The misconception: this is not just another VLA robot policy

The obvious way to classify Cosmos Policy is to put it next to recent vision-language-action models. That would be convenient, and also slightly misleading.

A VLA-style robot model usually starts from the fact that pretrained vision-language systems know a lot about objects, tasks, and language. That is useful when the robot must understand instructions such as “put the eggplant on the plate” or recognize that a zipper pull is the thing it should grasp. But the paper argues that robot control is not only a semantic problem. It is also a temporal and physical one: where will the scene go next, how does contact unfold, and what motion distribution is plausible?

Cosmos Policy starts from a different prior. Its base model is Cosmos-Predict2-2B-Video2World, a pretrained latent video diffusion model. The authors’ bet is that a video model has already learned something useful about scene evolution, motion, and implicit physical dynamics from video prediction. The robot policy then fine-tunes this model on target robot demonstrations.

The difference matters. A static image-language backbone is good at naming what is in front of the robot. A video backbone may be better positioned to model how things move when acted upon. That does not magically solve robotics, because the world remains annoyingly physical, but it changes where the transferred knowledge enters the system.

Latent frame injection is the small trick doing the large job

The paper’s core technical idea is almost suspiciously simple: instead of changing the architecture, represent non-video modalities as latent frames.

Cosmos-Predict2 originally takes an image and text description, then generates future video frames. It does not naturally know what to do with robot joint angles, action chunks, multiple camera views, future proprioception, or scalar values. Cosmos Policy handles this by inserting these items into the model’s latent sequence.

For a multi-camera robot setup, the latent sequence may include current robot proprioception, wrist-camera images, third-person camera images, an action chunk, future proprioception, future camera observations, and a future state value. The non-image variables are normalized, duplicated to fill latent volumes, and injected directly into placeholder latent frames. During training, the model learns to denoise the target parts of this mixed latent sequence while conditioning on the clean parts.

The useful mental model is this:

Element in the latent sequence	What it represents	Why it matters
Current images	What the robot sees now	Visual grounding
Current proprioception	Where the robot body is now	Embodiment state
Action chunk	What the robot should do next	Direct control output
Future images and proprioception	What the world may look like after action	World model signal
Future value	Expected task progress or return	Planning signal

This is not merely a packaging trick. Packaging is the interface between the pretrained model and the robot control problem. By treating actions and values as latent frames, the system can reuse the same diffusion machinery that previously generated video. The denoising process becomes the common language for predicting actions, imagined outcomes, and state values.

The paper emphasizes that no architectural modification is made to the base video model. That is important, but it should not be romanticized. The system still requires careful fine-tuning, modality normalization, noise schedule adjustment, and robot-specific data. The elegance is not “zero engineering.” The elegance is that the engineering keeps the pretrained video model’s representational machinery intact.

One model plays three roles: policy, world model, and value estimator

Cosmos Policy is trained to serve three related functions.

First, it is a policy: given the current observation and task description, it generates an action chunk. The authors use action chunks rather than single-step actions, which improves smoothness and avoids re-querying the model at every low-level control tick.

Second, it is a world model: given current state and action, it predicts future observations, including future camera views and future proprioception.

Third, it is a value estimator: it predicts the expected return or task progress associated with a future state.

During initial training, the batch is split across these objectives. Half of the batch trains the policy. The other half is divided between world model and value function training. The conditioning mask determines which part of the latent sequence is treated as input and which part the model must generate.

This joint setup has a practical consequence. The model does not learn “action” as an isolated regression target. It is asked to learn action together with what the action should cause. That auxiliary supervision turns out to matter.

In LIBERO ablations, removing auxiliary targets drops average success from 98.5% to 97.0%. Training the same system without the pretrained model drops it further to 94.6%. These are not catastrophic collapses, but the direction is clear: both the video prior and the extra future/value supervision help.

The RoboCasa appendix makes the point more sharply. When the authors progressively remove value-function training, world-model training, and auxiliary future-state/value targets, the largest performance collapse appears in the barebones action-only variant: average success falls from 67.1% to 44.4%. That is the paper quietly saying: “No, just asking a giant model for actions is not the whole trick.” Rude, but useful.

Planning works only after the robot has seen failure

The direct policy results are already strong, but the more conceptually interesting part is planning.

Cosmos Policy can generate multiple candidate action chunks. A separate rollout-refined planning model can then imagine the future state for each candidate, estimate the value of that future state, and select the candidate with the highest predicted value. This is best-of-N planning, not a deep search tree. The authors use a one-step search over action chunks, with ensembles for future-state and value predictions.

But there is a catch, and it is the kind of catch that matters in real deployment. Demonstrations usually overrepresent success. If the world model only sees successful trajectories, it may not understand what failure looks like. A robot trained only on neat demonstrations may not know that missing the zipper slider by a millimeter can doom the episode. The world, tragically, has no obligation to stay inside the demo distribution.

So the paper fine-tunes the planning model on rollout data. The authors collect policy rollouts, including successes and failures, and use them to refine the world model and value function. For the planning experiments, they aggregate 648 rollouts across policies and additional Cosmos Policy runs, focusing on two hard ALOHA tasks: “put candies in bowl” and “put candy in ziploc bag.”

This detail is not incidental. It defines what the planning result means. The paper is not claiming that the base demonstration-trained policy can plan effectively out of the box. It is claiming that, after collecting rollout experience, the same model family can refine its world/value predictions and use them to select better actions.

In business language: the robot improves not merely by watching experts, but by learning from its own messy attempts. Apparently robots, like junior analysts, need to be allowed to fail in a controlled environment before being trusted with the expensive equipment.

The evidence is strong, but it supports different claims at different levels

The paper’s experiments answer several different questions. Mixing them together would make the result look cleaner than it is. The cleaner interpretation is to separate the tests by purpose.

Evidence source	Likely purpose	What it supports	What it does not prove
LIBERO benchmark	Main direct-policy comparison	Cosmos Policy is highly competitive as an imitation policy across spatial, object, goal, and long-horizon suites	Does not prove real-world deployment robustness
RoboCasa benchmark	Data-efficiency and generalization comparison	Cosmos Policy reaches 67.1% average success using 50 human demos per task, outperforming methods trained with larger or augmented datasets	Does not show universal kitchen automation readiness
ALOHA direct policy tasks	Real-robot main evidence	Cosmos Policy achieves the highest average score across four bimanual tasks	Evaluation is limited to 101 trials and four selected tasks
LIBERO and RoboCasa ablations	Ablation evidence	Pretraining, auxiliary objectives, world/value training, and future-state supervision contribute to performance	Does not isolate every possible confound
ALOHA planning experiment	Planning extension	Rollout-refined model-based planning improves two difficult tasks by 12.5 points on average	Does not establish low-latency or dynamic-task suitability
Latency appendix	Implementation boundary	Direct inference can be under one second per action chunk; planning takes 4.9 seconds with 8 H100 GPUs	Does not solve deployment cost or reaction-time constraints

On LIBERO, Cosmos Policy reports 98.5% average success across the four main task suites, higher than the listed baselines. On RoboCasa, it reports 67.1% average success across 24 kitchen tasks while using only 50 human demonstrations per task. That comparison is especially relevant because several baselines use more demonstrations or synthetic data.

The real-robot ALOHA results are more operationally suggestive. Cosmos Policy reaches a 93.6 average score across four bimanual manipulation tasks. In the detailed table, it scores 100.0 on “put X on plate,” 99.5 on “fold shirt,” 89.6 on “put candies in bowl,” and 85.4 on “put candy in ziploc bag.” These are not toy movements. The last two tasks involve multimodal grasp choices and high-precision manipulation.

The authors’ qualitative failure analysis is also useful. Some competing policies struggle with high-precision zipper handling or with multimodal candy-grasp sequences. This aligns with the mechanism: a diffusion-style model can represent complex action distributions more naturally than a simple point-regression policy. That is a plausible explanation, not a universal theorem. Still, it is the kind of explanation that connects the result back to the architecture instead of leaving us staring at a leaderboard and pretending the leaderboard is an argument.

The appendix is not decoration; it tells us where the engineering risk sits

The appendix contains several details that are easy to skip and expensive to ignore.

First, the authors modify the diffusion noise distribution. The base video model’s sampling scheme is suitable for video generation, but action generation requires precision. The paper reports that the original log-normal noise distribution gives too little weight to high-noise regimes, causing poor initial denoising and cascading errors for action prediction. Cosmos Policy therefore uses a hybrid log-normal-uniform distribution during training, adding more weight to larger noise levels.

This is an implementation detail, but not a boring one. It says the video model’s native generation behavior does not transfer perfectly to control. A visually acceptable generated video can tolerate small imperfections. A robot action cannot always tolerate them. The robot does not care that the video would have looked plausible on social media.

Second, the inference-latency appendix clarifies the deployment trade-off. Direct Cosmos Policy inference takes 0.61 seconds per action chunk with 5 denoising steps on one H100 GPU for LIBERO and RoboCasa settings, and 0.95 seconds with 10 denoising steps for ALOHA. The paper also reports a one-step RoboCasa variant with 66.4% success and 0.16 seconds latency, only 0.5 points below the 5-step result.

That one-step result is practically important. It suggests that direct-policy deployment may have a speed/quality knob. The planning setup is very different. Model-based planning takes 4.9 seconds using 8 parallel H100 GPUs for best-of-N search. That may be acceptable for slow manipulation tasks where the robot can pause between action chunks. It is not obviously acceptable for dynamic manipulation, fast sorting, locomotion, or safety-critical real-time adaptation.

Third, the training compute is not small. LIBERO training uses 64 H100 GPUs for 48 hours. RoboCasa uses 32 H100 GPUs for 48 hours. ALOHA uses 8 H100 GPUs for 48 hours. This is not “download model, collect a few demos, automate your warehouse by Friday.” That sentence belongs in a vendor deck, preferably one nobody reads.

The business value is not cheaper robots; it is cheaper adaptation to hard manipulation

The business relevance of Cosmos Policy is not that it immediately makes robots cheap. The paper does not show that. The more realistic pathway is narrower and more useful: video-model priors may reduce the amount of task-specific robot action data needed for complex manipulation.

That matters because robot demonstrations are expensive. A human operator must teleoperate the robot. The setup must be safe. Edge cases must be collected. Failures must be analyzed. For contact-rich tasks, tiny variations in object pose, friction, deformability, and occlusion can change the outcome. The data problem is not just “we need more rows.” It is “we need the right failures, collected on the right platform, under the right variation.”

Cosmos Policy suggests a route where a pretrained video model supplies temporal and physical priors, then robot demonstrations specialize those priors to a platform. In RoboCasa, the model’s result with 50 human demonstrations per task is the cleanest business signal. If this pattern generalizes, the ROI is not merely better benchmark performance. It is a shorter path from prototype to workable policy for tasks where collecting thousands of high-quality demonstrations would be painful.

The relevant business cases are not generic “robots everywhere.” They are more specific:

Operational setting	Why Cosmos Policy is relevant	Why caution remains necessary
Warehouse manipulation	Many tasks involve object placement, sorting, and repeated grasping	Dynamic throughput constraints may punish slow inference
Kitchen or food-service automation	Contact-rich manipulation and object variation are common	Hygiene, safety, and object deformability add complexity beyond benchmarks
Lab automation	Precise handling and repeatable workflows suit controlled rollout collection	Rare failure modes may be costly
Light manufacturing	Fine manipulation and fixtures can be structured	Integration with legacy motion planning and safety systems remains nontrivial
Data-collection platforms for robotics vendors	Rollout learning can turn failed attempts into useful training signal	Requires disciplined evaluation and labeling infrastructure

The most important practical inference is this: robot learning systems may increasingly be evaluated not just by how well they imitate demonstrations, but by how efficiently they convert rollout failures into better planners.

That is a shift from “train once, deploy hopefully” toward “deploy in controlled conditions, collect failures, refine world/value prediction, then plan better.” For businesses, this implies that robotics ROI will depend as much on the feedback loop as on the initial model.

The planning result is promising because it is narrow, not because it is grand

The planning result should be read carefully.

The authors focus on the two harder ALOHA tasks because the base policy already performs strongly on LIBERO and on the first two ALOHA tasks. They collect rollout data, refine the planning model, and compare model-based planning against the base direct policy and a model-free Q-value variant. The model-based approach performs best and improves average score by 12.5 points across the two difficult tasks.

That is a meaningful result. It is also not a blanket claim that model-based planning is now solved for robotics. The search is shallow. The planning model requires rollout data. Inference is slow. The environment is real, but the task set is limited.

This is exactly why the result is useful. It does not ask us to believe in general robot intelligence. It shows a more concrete pattern:

Use a video foundation model as the policy backbone.
Encode actions, future observations, and values as latent frames.
Train the model jointly so it predicts action and consequence.
Collect rollout failures.
Refine the world/value model.
Search over action candidates using predicted future value.

That pipeline is understandable. It has engineering costs. It has bottlenecks. It can be improved. In other words, it is a research result that an operations-minded reader can actually reason about.

Where the result should not be overextended

There are four boundaries worth keeping separate.

First, the paper’s strongest direct-policy evidence comes from selected manipulation benchmarks and a four-task real ALOHA suite. The real-robot results are impressive, but they do not establish broad industrial readiness.

Second, planning needs rollout experience. Demonstrations alone are not enough for the world model and value function to understand failure. For business deployment, this means a safe rollout-collection process is part of the product, not an optional research luxury.

Third, latency remains a deployment constraint. Direct policy inference may be manageable for slow action chunks. Model-based planning at 4.9 seconds per action chunk on 8 H100 GPUs is a very different cost profile. Slow, deliberate manipulation may tolerate it. Fast dynamic tasks probably will not.

Fourth, full fine-tuning is compute-heavy. Even if model weights and code are released, reproducing or adapting the system at scale requires substantial GPU resources and robotics expertise. The barrier is lower than designing everything from scratch, but not low in absolute terms.

The practical conclusion is not “video models will run factories.” The better conclusion is: video models may become reusable control substrates when the task is sufficiently visual, temporal, and data-constrained, and when the operator can afford controlled rollout learning.

The strategic lesson: robotics foundation models need a memory of consequences

Cosmos Policy is useful because it reframes what a pretrained video model can contribute to robotics. The model is not just a source of visual features. It becomes a shared latent workspace where observations, actions, futures, and values can be modeled together.

That matters because manipulation is not merely about recognizing the world. It is about acting in a way that changes the world predictably enough. Cosmos Policy’s mechanism is valuable precisely because it connects action to imagined consequence inside the same diffusion sequence.

For companies watching embodied AI, the takeaway is not to chase every robot foundation model announcement with equal enthusiasm. The better question is more diagnostic:

Can the model represent the action distribution? Can it predict the consequence of candidate actions? Can it learn from failed rollouts? Can it do all of this fast enough and cheaply enough for the operating environment?

Cosmos Policy gives a strong research answer to the first three questions in controlled manipulation settings. The fourth remains the uncomfortable business question. Naturally, it is the one procurement will ask after the demo video ends.

The paper’s contribution is therefore not just another entry in the robotics benchmark race. It is a mechanism for turning video generation into visuomotor control: actions as latent frames, futures as latent frames, values as latent frames, all trained through the same diffusion backbone.

A video model that can imagine motion is interesting. A video model that can choose an action because it imagines the consequence is more interesting. Also more expensive. Progress usually sends an invoice.

Cognaptus: Automate the Present, Incubate the Future.

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu, “Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning,” arXiv:2601.16163, 2026, https://arxiv.org/abs/2601.16163. ↩︎

The misconception: this is not just another VLA robot policy#

Latent frame injection is the small trick doing the large job#

One model plays three roles: policy, world model, and value estimator#

Planning works only after the robot has seen failure#

The evidence is strong, but it supports different claims at different levels#

The appendix is not decoration; it tells us where the engineering risk sits#

The business value is not cheaper robots; it is cheaper adaptation to hard manipulation#

The planning result is promising because it is narrow, not because it is grand#

Where the result should not be overextended#

The strategic lesson: robotics foundation models need a memory of consequences#