Cover image

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

TL;DR for operators CAPO is not mainly a paper about “making models reason better” in the usual fog-machine sense. It is about fixing a specific training failure: outcome-only reinforcement learning tells a model whether the final answer was right, but not which part of the reasoning earned or destroyed that outcome. The method uses a stronger off-the-shelf LLM as a generative process reward model, or GenPRM, to inspect a rollout and identify wrong reasoning steps in one pass. Those step-level critiques are then converted into token-level penalties, so the policy update can suppress flawed reasoning segments instead of treating the whole answer as one indivisible blob. The authors test this across Llama-3-1B/3B and Qwen2.5-1.5B/7B backbones, with results showing consistent average gains over SFT, GRPO with rule-based verification, and GRPO with generative outcome reward modelling.1 ...

August 5, 2025 · 14 min · Zelina
Cover image

Red Flag on the Track: Why LLMs Still Struggle with Real Algorithmic Reasoning

TL;DR for operators FormulaOne is a useful red flag because it tests something many businesses quietly assume LLMs already possess: the ability to design deep algorithms, not merely write plausible code around familiar patterns.1 The benchmark contains 120 hard dynamic-programming problems on tree-like graphs, plus 100 easier FormulaOne-Warmup problems. The hard tasks are generated from Monadic Second-Order logic, come with verifiable evaluation, and sit near the kind of combinatorial reasoning used in routing, scheduling, network design and other optimisation-heavy domains. ...

July 18, 2025 · 17 min · Zelina