LLM Reasoning

Train Long, Think Short: How Curriculum Learning Makes LLMs Think Smarter, Not Longer

TL;DR for operators The paper behind this article proposes Curriculum GRPO: a reinforcement-learning training method that starts a reasoning model with a larger token budget, then gradually shrinks that budget until the model learns to solve problems in shorter traces.1 The important point is not “ask the model to be brief.” We have tried that. It works roughly as well as asking a committee to be concise, which is to say: occasionally, under duress. The paper instead changes the training trajectory. The model is first allowed to explore longer reasoning paths, then is forced to compress successful strategies into a tighter token budget. ...

Credit Where It's Due: How CAPO Brings Verifiable Precision to LLM Reasoning

TL;DR for operators CAPO is not mainly a paper about “making models reason better” in the usual fog-machine sense. It is about fixing a specific training failure: outcome-only reinforcement learning tells a model whether the final answer was right, but not which part of the reasoning earned or destroyed that outcome. The method uses a stronger off-the-shelf LLM as a generative process reward model, or GenPRM, to inspect a rollout and identify wrong reasoning steps in one pass. Those step-level critiques are then converted into token-level penalties, so the policy update can suppress flawed reasoning segments instead of treating the whole answer as one indivisible blob. The authors test this across Llama-3-1B/3B and Qwen2.5-1.5B/7B backbones, with results showing consistent average gains over SFT, GRPO with rule-based verification, and GRPO with generative outcome reward modelling.1 ...

How Sparse is Your Thought? Cracking the Inner Logic of Chain-of-Thought Prompts

TL;DR for operators Chain-of-thought prompting is often sold as a window into model reasoning. This paper is more useful because it treats CoT as something less mystical and more testable: a prompt-induced change in internal representations.1 The researchers train sparse autoencoders on hidden activations from two Pythia models solving GSM8K math problems under CoT and NoCoT prompts. They then patch CoT-derived sparse features into NoCoT runs and ask a sharper question: does inserting those internal features increase the log-probability of the correct answer? ...

$Cover image$

Tool Up or Tap Out: How Multi-TAG Elevates Math Reasoning with Smarter LLM Workflows

TL;DR for operators Most tool-using LLM workflows still behave like an intern with a favourite spreadsheet: they call one tool, trust the result, and hope the formatting does not catch fire. Multi-TAG proposes a more disciplined pattern. At each reasoning step, the model does not simply choose between chain-of-thought, Python, or WolframAlpha. It asks several tool-backed executors to propose candidate next steps, checks which candidates lead to the same estimated final answer, and then selects the shortest completion among the candidates that agree. That is the useful idea: not “give the model tools,” but “make tools disagree in a controlled way, then use agreement as a verification signal.” ...

Train of Thought: How Long-Haul RL Unlocks LLM Reasoning Diversity

TL;DR for operators NVIDIA’s paper is not saying “train longer and reasoning magically appears.” That would be comforting, simple, and wrong — a classic enterprise AI trifecta. The practical lesson is more surgical: prolonged reinforcement learning can keep improving a small reasoning model, but only when the training loop actively prevents collapse. The model needs verifiable rewards, diverse tasks, enough rollout diversity, careful clipping, a small KL penalty, reward shaping when behaviour goes off the rails, and periodic resets of both the reference policy and optimiser state. In other words, long-horizon RL behaves less like a single training job and more like operating a live system under stress. ...

Backtrack to the Future: How ASTRO Teaches LLMs to Think Like Search Algorithms

TL;DR for operators ASTRO is not another paper saying “make the model think longer” and then acting surprised when token bills become a lifestyle choice. It is more specific: the authors train a non-reasoner Llama model to imitate the procedure of search. The model is taught to explore a wrong path, notice uncertainty, backtrack, and continue from an earlier step — all inside one generated answer. ...

Enhancing Privately Deployed AI Models: A Sampling-Based Search Approach

TL;DR for operators Private AI pilots usually fail in a familiar place: the model gives one confident answer, everyone pretends the confidence means something, and then a human quietly redoes the work. Sampling-based search offers a more disciplined alternative. Instead of asking a privately deployed model for one answer, the system asks for many candidate answers, verifies them, compares the strongest contenders, and returns the answer with the best support. The target paper, Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification, studies this pattern at meaningful scale and shows that a minimalist version can materially improve reasoning performance without retraining the base model.1 ...