Replay the Losses, Win the Game: When Failed Instructions Become Your Best Training Data
Opening — Why this matters now Reinforcement learning for large language models has a dirty secret: most of the time, nothing happens. When tasks demand perfect instruction adherence—formatting, style, length, logical constraints—the model either nails everything or gets a zero. Binary rewards feel principled, but in practice they starve learning. Aggregated rewards try to help, but they blur causality: different mistakes, same score, same gradient. The result is slow, noisy, and often misdirected optimization. ...