Process Supervision

Wait, Let Me Check: Why Long-CoT AI Can Still Verify the Wrong Thing

Checking is supposed to calm people down. In business, a second review makes a financial model feel safer. A compliance checklist makes a release feel governed. A senior analyst saying “let me double-check that” gives the room a small dopamine hit of procedural seriousness. Long Chain-of-Thought models have learned the same theatre. They pause. They reconsider. They say “wait.” They verify arithmetic. They sometimes generate reasoning traces so long that one begins to feel the model must be thinking deeply, if only because wasting that many tokens while being shallow seems rude. ...

Credit Where It’s Due: The New Reasoning Stack for Agentic AI

Opening — Why this matters now The current agentic AI conversation has a very convenient myth: if an AI agent fails, give it a better model, a longer context window, more tool calls, and perhaps a heroic prompt containing the phrase “think step by step” in several places. Then wait for magic. Preferably billable magic. ...

When Solvers Become Judges (and Fail): Why LLMs Still Struggle to Critique Reasoning

Correction is the expensive part. Answer generation is already the familiar magic trick. Give a model a problem, ask for a solution, and admire the fluent staircase of reasoning. Sometimes the staircase even reaches the right floor. That is nice. Investors clap. Product managers update the roadmap. Somewhere, a slide says “AI tutor,” “AI reviewer,” or “autonomous verification layer.” ...

When Puzzles Become Process: Benchmarking the Agentic Mind

More thinking is not the same as better work A manager asks an AI agent to reconcile invoices, check a procurement exception, or review a regulatory document. The agent pauses, consumes a heroic number of tokens, and returns a polished answer. Very impressive. Very modern. Also, perhaps, completely wrong. The industry has become comfortable with a simple story: give models more reasoning budget and they will reason better. That story is not false. It is merely incomplete, which is where most expensive mistakes prefer to live. ...

Adversaries, Slices, and the Art of Teaching LLMs to Think

A math tutor does not wait until the end of a two-page solution, circle the final answer, and say “wrong.” At least, not a good one. The useful tutor interrupts earlier. This line follows. That parity condition does not. This factorization is legal, but the conclusion you drew from it is not. The feedback is local, not theatrical. It tells the student where the reasoning began to rot, before the final answer becomes merely the visible corpse. ...

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

TL;DR for operators CAPO is not mainly a paper about “making models reason better” in the usual fog-machine sense. It is about fixing a specific training failure: outcome-only reinforcement learning tells a model whether the final answer was right, but not which part of the reasoning earned or destroyed that outcome. The method uses a stronger off-the-shelf LLM as a generative process reward model, or GenPRM, to inspect a rollout and identify wrong reasoning steps in one pass. Those step-level critiques are then converted into token-level penalties, so the policy update can suppress flawed reasoning segments instead of treating the whole answer as one indivisible blob. The authors test this across Llama-3-1B/3B and Qwen2.5-1.5B/7B backbones, with results showing consistent average gains over SFT, GRPO with rule-based verification, and GRPO with generative outcome reward modelling.1 ...

Good Bot, Bad Reward: Fixing Feedback Loops in Vision-Language Reasoning

TL;DR for operators The useful lesson is not that vision-language models need longer reasoning traces. They already produce plenty of words. Some of them are even adjacent to thought. The useful lesson is that multimodal systems need feedback that can tell where a reasoning path breaks, not merely whether the final answer looks acceptable. ...