
Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think
TL;DR Most LLMs reason token‑by‑token and get lost in the weeds. PTA‑GRPO is a two‑stage method that (1) distills short, high‑level plans from a stronger teacher and (2) reinforces both the final answer and the plan’s quality. Across math benchmarks, it reliably outperforms GRPO/DAPO while producing shorter, cleaner solutions. For AI builders, the principle is simple: force an outline, then reward it. Why this paper matters for builders (not just benchmark chasers) From local greed to global guidance. Traditional CoT is myopic: it optimizes each next token. PTA‑GRPO adds a global outline that trims detours and reduces reasoning drift. Aligns with how teams actually work. Great analysts draft an outline before the memo; great agents should too. PTA‑GRPO operationalizes that habit. Product leverage: If your agents make multi‑step decisions (pricing, triage, troubleshooting), rewarding plan quality prevents hallucinated subgoals and makes reasoning auditable. Compute sanity: Instead of expensive tree search at inference, PTA‑GRPO trains planning skill so you can keep runtime simple. The core idea in one picture (words) Plan → Think → Answer. ...