Process Reward Models

Process Reward Agents — When Reasoning Learns to Judge Itself (Before It’s Too Late)

Reasoning systems have a familiar failure mode: they can sound calm while quietly walking off a cliff. A model begins with a plausible assumption, adds a second plausible sentence, then a third. By the time the final answer arrives, the mistake is no longer obvious because it has been wrapped in a competent-looking explanation. In low-stakes writing, this is annoying. In medicine, finance, compliance, or legal reasoning, it is a process failure masquerading as intelligence. ...

When Failure Pays Dividends: Recycling Reasoning in RLVR with SCOPE

Failure logs are usually where AI teams put the evidence that training was expensive. A reasoning model tries a problem. It gets most of the chain right. Then, near the end, it makes one bad algebraic turn, chooses the wrong case, or quietly invents a rule that mathematics did not approve. Under standard reinforcement learning from verifiable rewards, that rollout receives the same score as nonsense: zero. The model may have climbed nine floors and tripped on the final step; the reward system marks it as indistinguishable from someone who never entered the building. ...

Prints Charming: How Reward Models Finally Got Serious About Long-Horizon Reasoning

Search looks simple until it becomes a workflow. A human analyst can open ten tabs, notice which source contradicts which, remember that one earlier search result changed the meaning of the question, and decide whether the next move should be another search, a calculation, or a final answer. An LLM agent can also open tabs, call tools, browse pages, run code, and produce a final answer. The difference is that the agent often does all of this with the discipline of a caffeinated intern who has been told that “more context” is the same thing as “better memory.” ...

Judge, Jury, and Chain‑of‑Thought: Making Models StepWiser

TL;DR for operators StepWiser is a judge for multi-step reasoning systems. Its practical claim is simple: do not wait until the final answer is wrong before discovering that the model fell off a cliff three paragraphs earlier. The paper turns process supervision into a three-part mechanism. First, the solver is taught to divide its reasoning into coherent “chunks-of-thought” rather than arbitrary line breaks. Second, each chunk is labelled by estimating whether continuing after that chunk improves or harms the probability of eventually reaching a correct answer. Third, a separate judge is trained with online reinforcement learning to reason about each chunk before deciding whether it is valid.1 ...