LLM Reasoning

Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think

TL;DR Most LLMs reason token‑by‑token and get lost in the weeds. PTA‑GRPO is a two‑stage method that (1) distills short, high‑level plans from a stronger teacher and (2) reinforces both the final answer and the plan’s quality. Across math benchmarks, it reliably outperforms GRPO/DAPO while producing shorter, cleaner solutions. For AI builders, the principle is simple: force an outline, then reward it. Why this paper matters for builders (not just benchmark chasers) From local greed to global guidance. Traditional CoT is myopic: it optimizes each next token. PTA‑GRPO adds a global outline that trims detours and reduces reasoning drift. Aligns with how teams actually work. Great analysts draft an outline before the memo; great agents should too. PTA‑GRPO operationalizes that habit. Product leverage: If your agents make multi‑step decisions (pricing, triage, troubleshooting), rewarding plan quality prevents hallucinated subgoals and makes reasoning auditable. Compute sanity: Instead of expensive tree search at inference, PTA‑GRPO trains planning skill so you can keep runtime simple. The core idea in one picture (words) Plan → Think → Answer. ...

How Sparse is Your Thought? Cracking the Inner Logic of Chain-of-Thought Prompts

Chain-of-Thought (CoT) prompting has become a go-to technique for improving multi-step reasoning in large language models (LLMs). But is it really helping models think better—or just encouraging them to bluff more convincingly? A new paper from Leiden University, “How does Chain of Thought Think?”, delivers a mechanistic deep dive into this question. By combining sparse autoencoders (SAEs) with activation patching, the authors dissect whether CoT actually changes what a model internally computes—or merely helps its outputs look better. ...

The Two Minds of Finance: Testing LLMs for Divergence and Discipline

How do we judge whether an AI is thinking like a human—or at least like a financial analyst? A new benchmark, ConDiFi, offers a compelling answer: test not just whether an LLM gets the right answer, but whether it can explore possible ones. That’s because true financial intelligence lies not only in converging on precise conclusions but in diverging into speculative futures. Most benchmarks test convergent thinking: answer selection, chain-of-thought, or multi-hop reasoning. But strategic fields like finance also demand divergent thinking—creative, open-ended scenario modeling that considers fat-tail risks and policy surprises. ConDiFi (short for Convergent-Divergent for Finance) is the first serious attempt to capture both dimensions in one domain-specific benchmark. ...

Backtrack to the Future: How ASTRO Teaches LLMs to Think Like Search Algorithms

A persistent mystery in the recent surge of reasoning-augmented LLMs—like OpenAI’s o1 or DeepSeek-R1—is whether these models learn to reason through post hoc reinforcement fine-tuning, or if they were already good at it to begin with. ASTRO offers a rare counter-example: a method that imbues non-reasoner LLMs (like vanilla Llama 3) with structured reasoning behavior from scratch. Rather than rely on emergent capabilities or distillation from models that already search well, ASTRO teaches LLMs to think like search algorithms themselves, using a hybrid approach combining Monte Carlo Tree Search (MCTS), procedure cloning, chain-of-thought generation, and reinforcement learning with verifiable rewards. ...