Process-Reward

TL;DR Generative judges that think before they judge—and are trained with online RL using stepwise labels—beat classic discriminative process reward models (PRMs). The StepWiser approach brings three wins: (1) higher accuracy at spotting the first bad step, (2) cleaner, more reliable inference via a “chunk‑reset” search that prunes bad steps while keeping overall length similar, and (3) better data selection for fine‑tuning. Why this matters (for builders and buyers) Most enterprise CoT systems fail not because they can’t produce long reasoning, but because they can’t police their own steps. Traditional PRMs act like a yes/no bouncer at each step—fast, but shallow. StepWiser reframes judging as its own reasoning task: the judge writes an analysis first, then issues a verdict. That small shift has big, practical consequences: ...