When Failure Pays Dividends: Recycling Reasoning in RLVR with SCOPE
Failure logs are usually where AI teams put the evidence that training was expensive. A reasoning model tries a problem. It gets most of the chain right. Then, near the end, it makes one bad algebraic turn, chooses the wrong case, or quietly invents a rule that mathematics did not approve. Under standard reinforcement learning from verifiable rewards, that rollout receives the same score as nonsense: zero. The model may have climbed nine floors and tripped on the final step; the reward system marks it as indistinguishable from someone who never entered the building. ...