RLVR

TL;DR for operators CAPO is not mainly a paper about “making models reason better” in the usual fog-machine sense. It is about fixing a specific training failure: outcome-only reinforcement learning tells a model whether the final answer was right, but not which part of the reasoning earned or destroyed that outcome. The method uses a stronger off-the-shelf LLM as a generative process reward model, or GenPRM, to inspect a rollout and identify wrong reasoning steps in one pass. Those step-level critiques are then converted into token-level penalties, so the policy update can suppress flawed reasoning segments instead of treating the whole answer as one indivisible blob. The authors test this across Llama-3-1B/3B and Qwen2.5-1.5B/7B backbones, with results showing consistent average gains over SFT, GRPO with rule-based verification, and GRPO with generative outcome reward modelling.1 ...

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

Red Flag on the Track: Why LLMs Still Struggle with Real Algorithmic Reasoning