Replay the Losses, Win the Game: When Failed Instructions Become Your Best Training Data
Failure logs are usually treated as evidence for the prosecution. A model is asked to produce a concise compliance summary with three bullet points, mention two risks, avoid prohibited claims, and end with a recommendation. It produces three bullets, correctly identifies the risks, avoids the prohibited claims—and forgets the recommendation. Under a strict binary reward, the response receives a zero. Under a partial-credit reward, it might receive 0.75. The first signal says nothing useful happened. The second says something useful happened, but not precisely what. ...