Verifiable Rewards

TL;DR for operators NVIDIA’s paper is not saying “train longer and reasoning magically appears.” That would be comforting, simple, and wrong — a classic enterprise AI trifecta. The practical lesson is more surgical: prolonged reinforcement learning can keep improving a small reasoning model, but only when the training loop actively prevents collapse. The model needs verifiable rewards, diverse tasks, enough rollout diversity, careful clipping, a small KL penalty, reward shaping when behaviour goes off the rails, and periodic resets of both the reference policy and optimiser state. In other words, long-horizon RL behaves less like a single training job and more like operating a live system under stress. ...

Verifiable Rewards

Bending the Beam, Not the Brain: What RL with Perfect Rewards Still Can’t Teach LLMs

Train of Thought: How Long-Haul RL Unlocks LLM Reasoning Diversity