Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think

TL;DR

Most LLMs reason token‑by‑token and get lost in the weeds. PTA‑GRPO is a two‑stage method that (1) distills short, high‑level plans from a stronger teacher and (2) reinforces both the final answer and the plan’s quality. Across math benchmarks, it reliably outperforms GRPO/DAPO while producing shorter, cleaner solutions. For AI builders, the principle is simple: force an outline, then reward it.

Why this paper matters for builders (not just benchmark chasers)

From local greed to global guidance. Traditional CoT is myopic: it optimizes each next token. PTA‑GRPO adds a global outline that trims detours and reduces reasoning drift.
Aligns with how teams actually work. Great analysts draft an outline before the memo; great agents should too. PTA‑GRPO operationalizes that habit.
Product leverage: If your agents make multi‑step decisions (pricing, triage, troubleshooting), rewarding plan quality prevents hallucinated subgoals and makes reasoning auditable.
Compute sanity: Instead of expensive tree search at inference, PTA‑GRPO trains planning skill so you can keep runtime simple.

The core idea in one picture (words)

Plan → Think → Answer.

Plan: Generate a compact analytic plan that lists phases/subgoals—no arithmetic, no step‑by‑step yet.
Think: Execute detailed Chain‑of‑Thought conditioned on the plan, revising it if needed.
Answer: Produce a concise final result.

The twist: RL doesn’t only score the answer. It also scores how well the plan tends to yield correct solutions and whether the response follows a clean, parsable format.

How PTA‑GRPO actually works

Stage 1 — Cold‑start with planning‑structured SFT

Take a reasoning dataset with CoT and ask a stronger model to summarize each solution into a short plan.
Fine‑tune the target model to emit <plan>…</plan>, then <think>…</think>, then <answer>…</answer>.

Stage 2 — Planning‑guided RL (GRPO, upgraded)

At rollout time, sample multiple plans per question; for each plan, sample several CoTs. Compute a composite reward:

Reward component	What it measures	Why it helps	Implementation cue
Analytic plan reward	For a given plan, the empirical accuracy of the CoTs sampled under it	Direct pressure to craft useful plans, not pretty ones	Softmaxed pass‑rate per plan (amplifies separation)
Outcome reward	Correctness of final answer	Keeps the eye on the prize	Standard RLVR signal
Format/length reward	Output adheres to `<plan><think><answer>` and stays concise	Yields stable, short, parsable traces	Binary format bonus + length bonus toward shortest correct trace

Net effect: models learn what to plan, how to follow it, and when to stop.

Results at a glance

Backbone	Method	AIME24	AIME25	MATH500	AMC23	Avg
Qwen2.5‑7B‑Instruct	Base	12.2	3.5	62.4	52.8	32.7
	GRPO	27.5	22.3	82.7	63.6	49.0
	PTA‑GRPO	30.3	26.0	85.6	70.2	53.0
LLaMA3.2‑3B	GRPO	16.3	14.2	55.2	38.3	31.0
	PTA‑GRPO	20.5	14.3	60.3	40.4	33.9
Qwen3‑8B	GRPO	68.3	54.2	92.9	92.0	76.9
	PTA‑GRPO	68.9	54.3	93.3	92.3	77.2
Qwen3‑14B	GRPO	71.3	71.3	90.3	94.9	82.0
	PTA‑GRPO	73.9	71.6	91.9	95.0	83.1

Takeaways

Consistent wins across small and large models; the biggest lifts show up on weaker backbones.
Data scale helps: moving RL data from 4k → 14k steadily boosts avg scores.
Ablations: removing planning or SFT hurts; format reward trades a tiny bit of score for stability and brevity.

Business translation: Where this helps today

Customer support playbooks

Problem: LLM agents ramble or miss key steps in refund/verification flows.
PTA move: Force <plan> sections to enumerate required checks; reward plans whose downstream transcripts resolve tickets with fewer back‑and‑forths.

Pricing & discount policies

Problem: Token‑by‑token greed yields inconsistent justification.
PTA move: Plans must list applicable rules and constraints first; RL rewards plans that produce consistent, policy‑compliant offers with minimal tokens.

DevOps incident response

Problem: Agents fix a symptom, not the cause.
PTA move: Plans must state hypothesis tree → diagnostics → rollback criteria; reward plans that reduce mean tokens‑to‑resolution and human escalations.

Implementation notes for your stack

Add an outline gate at inference even before you retrain: prompt your current model to emit <plan> → <think> → <answer>. You’ll often see fewer mistakes right away.
Log per‑plan pass‑rates: during evaluation, sample multiple CoTs under each plan and keep plan‑level metrics. It’s incredibly diagnostic.
Design format schemas: make <plan> strictly non‑computational. Plans that sneak calculations tend to overfit and become brittle.
Reward brevity: encourage shortest correct traces. It trims cost without sacrificing accuracy.
Guardrails: reject outputs without all three tags; your downstream parsers will thank you.

Open questions we’re tracking

Transfer to non‑math domains: Legal, medical, and multi‑turn biz tasks need plan quality signals beyond exact‑match answers. What’s the right surrogate (e.g., rubric‑graded outcomes, policy conformance, user‑rated utility)?
Plan diversity vs. convergence: Softmaxing plan success magnifies winners—great for quality, but could reduce diversity. How to keep exploration alive?
Plan brittleness: If the outline is wrong, can the model debug the plan mid‑flight without spiraling? The paper allows revision in <think>, but real systems may need explicit plan‑repair steps.

What we’ll do at Cognaptus

Ship a plan‑first prompting preset in our internal agents.
Add per‑plan analytics to our evaluation harness (pass@K under fixed plan).
Prototype a policy‑conformance reward for ops agents mirroring the paper’s format/length bonus.

Cognaptus: Automate the Present, Incubate the Future.

TL;DR#

Why this paper matters for builders (not just benchmark chasers)#

The core idea in one picture (words)#

How PTA‑GRPO actually works#

Stage 1 — Cold‑start with planning‑structured SFT#

Stage 2 — Planning‑guided RL (GRPO, upgraded)#

Results at a glance#

Business translation: Where this helps today#

Implementation notes for your stack#

Open questions we’re tracking#

What we’ll do at Cognaptus#