On-Policy Distillation

TL;DR for operators The paper’s main message is simple: if a reasoning model has already walked into a dead end, per-token distillation often keeps supervising it from inside the dead end. A clever loss cap is not a map. A top-k filter is not a tow truck. Trajectory-Refined Distillation, or TRD, repairs the student’s own rollout before using it for distillation. The pipeline is: sample the student’s attempt, ask a teacher or privileged self-teacher to rewrite the trajectory into a better one, then train on the refined trajectory rather than on the original failed rollout. The technical contribution is not “better prompting”, although prompts are used. It is the shift from token-level correction to trajectory-level correction. ...