Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think

Planning is usually the part of work everybody claims to value and nobody wants to inspect. The deck has a roadmap. The project has a strategy. The model has a chain of thought. Splendid. Now, does the plan actually make the execution better, or is it just theatre with bullet points?

That is the useful question behind Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning, which introduces PTA-GRPO, a reinforcement-learning method that trains language models to generate an explicit analytic plan before detailed reasoning and then rewards the quality of that plan, not merely the final answer.¹

The paper is not interesting because it says “planning helps.” We have heard that line before, usually from people selling frameworks, templates, or workshops with beige diagrams. Its more specific contribution is sharper: if a model is going to plan before it reasons, the plan has to become part of the optimisation problem. Otherwise, the plan is just another generated prefix, and a bad prefix can make the model worse.

That distinction matters. Prompting a model to “think step by step” or “make a plan first” may improve some outputs, but it does not necessarily teach the model what a useful plan is. PTA-GRPO tries to close that gap by turning plan quality into a reward signal inside reinforcement learning.

The business translation is simple enough to be dangerous: enterprises should not copy the surface behaviour and force every AI workflow to emit a pretty outline. The lesson is deeper. If the intermediate structure matters, measure it. If it is not measured, do not pretend it is governance.

The real target is local reasoning myopia

The paper starts from a familiar weakness in chain-of-thought reasoning. LLMs generate text token by token. Even when the output looks like a coherent reasoning chain, the generation process is still local: each next token is conditioned on the previous context. That can preserve local fluency while failing at global direction.

For easy tasks, this is often fine. For long-horizon reasoning, it becomes expensive. The model may drift, repeat itself, follow a wrong early step, or “reflect” on a corrupted partial solution and confidently deepen the hole. Self-correction is charming until the thing being corrected is already nonsense. Then it becomes an internal audit department approving its own fraud.

Existing approaches attack this problem in several ways. Tree-search methods widen exploration by sampling and evaluating multiple reasoning paths. Reinforcement-learning methods such as GRPO reward correct final answers and use group-relative comparisons instead of a separate value model. Plan-and-act prompting separates high-level planning from execution.

PTA-GRPO is positioned between these families. It does not rely primarily on external search at inference time. It also does not settle for final-answer reward alone. Instead, it tries to internalise planning as a trainable behaviour.

The key move is to treat the plan as an optimisation target.

PTA-GRPO has two stages, and the first one is not optional decoration

PTA-GRPO begins with a supervised cold start called Planning Structured Reasoning Cold-Start, or PSR-CS. The authors sample 10,000 examples from OpenThoughts. Each example contains a problem and a chain-of-thought solution. A stronger model, identified in the paper as Qwen3-235B, summarises the reasoning into a compact analytic plan.

The resulting training example has three layers:

Layer	Role in the training example	Operational meaning
`<plan>`	A compact high-level solution outline	What the model intends to do
`<think>`	The detailed reasoning trajectory	How the model executes the plan
`<answer>`	The final result	What the model claims

This matters because the model is not asked to discover planning from final rewards alone. It is first shown what the plan-reason-answer structure should look like. The cold start gives the policy an initial behavioural shape before reinforcement learning begins.

The paper’s SFT comparison supports this. For Qwen2.5-7B-Instruct, SFT without analytic planning averages 45.03 across the four math benchmarks, while SFT with planning reaches 47.43. For Qwen3-8B, the same comparison moves from 75.92 to 77.46. These are not enormous gains, but they are directionally consistent across the reported benchmarks.

That is the first important interpretation: the plan format is not magic. It is scaffolding. It gives the later RL stage something meaningful to optimise.

The reward asks a blunt question: did this plan lead to correct reasoning?

The second stage is Plan Structure-Guided Reinforcement Learning, or PSG-RL. This is where PTA-GRPO modifies GRPO.

Standard GRPO samples multiple responses for a question, scores them using verifiable rewards, and updates the policy using group-relative advantages. In reasoning tasks, the reward is usually concentrated on final correctness. Correct answer: good. Incorrect answer: bad. The model can improve, but the reward does not directly distinguish a clean reasoning trajectory from a lucky stumble.

PTA-GRPO adds a plan-quality reward. The clever part is that the paper does not pretend to directly grade the semantic beauty of a plan. That would require another judge, which would introduce its own problems. Instead, the method uses a practical surrogate.

For each question, the policy samples candidate analytic plans. For each plan, it samples multiple detailed chain-of-thought trajectories conditioned on that plan. The plan reward is based on the empirical accuracy of the outcomes generated under that plan. In plainer English: a plan is treated as good if it repeatedly helps the model reach the correct answer.

That is a useful idea because it converts plan quality from vibes into observed downstream usefulness. It is still a proxy, but it is at least a proxy with teeth.

The total reward combines three components:

Reward component	What it encourages	Why it matters
Plan reward	Plans that guide successful reasoning trajectories	Makes planning trainable rather than decorative
Outcome reward	Correct final answers	Keeps the model anchored to task success
Format and length reward	Structured, parsable, concise outputs	Reduces malformed or bloated reasoning

The format reward is not just cosmetic. The appendix describes a structural reward for the required <plan>, <think>, and <answer> template, plus a length-related reward that nudges correct responses toward concise execution. In business terms, this is the difference between a system that merely “reasons” and a system whose intermediate artefacts are easy to parse, compare, and supervise.

That said, the plan reward is the conceptual centre. PTA-GRPO is not simply “GRPO with tags.” It is GRPO with a second object of optimisation: the plan that guides the reasoning.

The main benchmark result is stronger on weaker models and harder tasks

The paper evaluates PTA-GRPO on four mathematical reasoning benchmarks: MATH500, AIME24, AIME25, and AMC23. It uses four language-model backbones: Qwen2.5-7B-Instruct, LLaMA3.2-3B, Qwen3-8B, and Qwen3-14B. The main comparison includes the base models, GRPO, DAPO, and several other post-training methods depending on the backbone.

The average results are the cleanest way to read the main table:

Base model block	Base average	GRPO average	DAPO average	PTA-GRPO average	Best interpretation
Qwen2.5-7B-Instruct	34.23	54.81	55.04	58.46	Strong gain over outcome-only RL baselines
LLaMA3.2-3B	15.79	36.63	36.98	40.37	Strongest relative story for weaker backbones
Qwen3-8B	74.21	78.59	76.85	79.10	Modest average gain; some saturation
Qwen3-14B	82.17	83.08	83.31	85.01	Positive but smaller incremental gain

The result is not a perfect sweep across every individual cell. For Qwen3-8B on MATH500, GRPO slightly exceeds PTA-GRPO. For Qwen3-14B on AMC23, DAPO is higher. This is not a fatal issue; it is the sort of detail that prevents a benchmark table from turning into fan fiction.

The more useful pattern is that PTA-GRPO’s gains are larger where the model has more room to improve. Smaller or weaker models benefit more. Harder long-horizon tasks benefit more consistently. The appendix’s statistical-significance analysis reports 13 significant improvements out of 16 model-task pairs when comparing PTA-GRPO with GRPO. It also notes that gains are especially pronounced for Qwen2.5-7B and LLaMA3.2-3B, while larger models show more modest or occasionally insignificant improvements.

That is exactly what we should expect if planning guidance is reducing search and execution errors. Once a model is already strong and a benchmark is closer to saturation, the marginal value of explicit planning shrinks. There is only so much glory in teaching a competent model to write a better shopping list for a task it already knows how to do.

The ablations say the method is a system, not a single trick

The ablation table is important because it asks whether the gains come from the full pipeline or from one convenient ingredient.

The reported full PTA-GRPO configuration reaches an average of 58.46 on the four math benchmarks using Qwen2.5-7B-Instruct. Removing the SFT cold start drops the average to 46.33. Two reward-disabled variants report lower averages of 57.32 and 54.56.

The exact table labels are visually compressed in the HTML rendering, but the directional evidence is clear: the cold start matters, and the reward design matters. The system works best when planning is first demonstrated and then reinforced.

That is a more credible claim than “we added a planning prompt and scores went up.” The paper is making a pipeline argument:

Teach the model the structure of plan-then-reason outputs.
Generate multiple candidate plans.
Evaluate plans by the success rate of reasoning trajectories conditioned on them.
Reward final correctness and structured execution.
Update the policy so planning becomes part of the model’s learned behaviour.

Each piece has a role. Remove enough structure and the mechanism weakens.

Scaling helps, but it is not linear fairy dust

The paper also tests RL data scaling for Qwen2.5-7B-Instruct. As the RL training set grows from 4,000 to 60,000 examples, average performance rises from 48.94 to 58.46.

RL data scale	Average score
4k	48.94
8k	50.60
11k	51.86
14k	53.01
30k	55.37
60k	58.46

The trend supports the idea that PTA-GRPO continues to benefit from more RL data. But the gains are not perfectly linear, and they should not be read as a universal scaling law. This is one model family, one training setup, and a math-heavy evaluation regime.

For practitioners, the lesson is not “throw more data at planning.” It is that process-level rewards may scale usefully when the task has verifiable outcomes. That last clause is doing a lot of work. Math benchmarks give clean correctness signals. Enterprise work often does not.

A compliance memo, procurement recommendation, or investment memo may not have a single verifiable answer sitting conveniently in the dataset. The more ambiguous the target, the harder it becomes to reward plan quality without importing a fragile evaluator.

The appendix mostly tests robustness, not a second thesis

Several appendix results are worth separating by purpose. They do not all prove the same thing.

Test	Likely purpose	What it supports	What it does not prove
SFT with vs without planning	Ablation / supervision test	Planning examples improve cold-start reasoning structure	Planning alone is enough
RL data scaling	Scaling robustness	More RL data improves this setup	Universal scaling across domains
Training-time comparison	Implementation-cost check	PTA-GRPO has similar reported training time to GRPO/DAPO under the tested rollout budget	Lower total engineering cost in production
Self-generated plans	Robustness test	The method remains competitive without advanced-model-generated plans	Advanced plan generation is never useful
Sensitivity to reward weight	Hyperparameter sensitivity	Moderate weighting performs best on average	One reward setting fits all tasks
Sampled CoTs per plan	Sampling sensitivity	More CoTs per plan improve plan-reward estimation in this setup	Unlimited sampling would keep paying off
Statistical-significance table	Reliability check	Gains over GRPO are often significant, especially for weaker models and harder tasks	Every benchmark cell is materially superior

The training-time result is particularly relevant. The paper reports that for Qwen2.5-7B-Instruct, GRPO takes 44.7 hours, DAPO 47.5 hours, and PTA-GRPO 44.9 hours. For Qwen3-8B, GRPO takes 59.7 hours, DAPO 66.9 hours, and PTA-GRPO 61.4 hours. The authors attribute this to using the same rollout budget and reusing the sampled CoTs for policy updates.

That is an important operational claim: the plan reward is not presented as a major extra sampling burden under their setup. Still, “similar training time” is not the same as “same implementation complexity.” PTA-GRPO requires data construction, plan formatting, reward plumbing, sampling design, and hyperparameter tuning. The compute clock is only one line item. Conveniently, it is also the one line item engineers like to quote when pretending integration is free.

The multimodal and science results extend the claim, cautiously

The paper also evaluates PTA-GRPO on multimodal and science benchmarks using Qwen2.5-7B-VL. It compares the base model, SRPO, and PTA-GRPO across MMMU-Pro, MMMU, EMMA, and physics, chemistry, and biology subsets.

PTA-GRPO outperforms SRPO in the reported table:

Benchmark	Base	SRPO	PTA-GRPO
MMMU-Pro	36.9	42.3	44.7
MMMU	54.3	57.1	59.0
EMMA	21.5	29.6	31.9
Physics	45.4	56.2	58.5
Chemistry	56.4	65.2	68.7
Biology	54.0	65.2	66.8

This broadens the evidence beyond pure text math. It suggests that plan-guided reinforcement can help reasoning tasks where inputs include visual or scientific structure. But it should still be read as benchmark generalisation, not enterprise generalisation.

There is a difference between “works on multimodal benchmarks” and “is ready to inspect factory defects, legal exhibits, insurance claims, or clinical images.” The former is promising. The latter requires domain-specific validation, liability design, data governance, and a very different tolerance for plausible nonsense.

The self-plan result matters for deployment economics

One appendix result deserves more business attention than it will probably get. PTA-GRPO normally uses plans summarised by a stronger model during data construction. That raises an obvious question: is the method dependent on expensive teacher-model planning?

The paper tests a self-generated-plan variant. The full PTA-GRPO setup averages 58.46. The self-plan variant averages 57.37. The gap is visible but not dramatic.

That matters because a production training pipeline that depends heavily on a frontier teacher model may be more expensive, slower, and harder to control. If a policy model can generate usable plans for training with only a limited performance drop, the method becomes more operationally plausible.

The caveat is equally important. The self-plan result is reported within the paper’s experimental conditions. It does not prove that smaller models can self-bootstrap high-quality plans in every domain. It says the dependency on an advanced planner may be weaker than feared in this benchmark setting.

Good news, then, but not a blank cheque. Finance departments may now resume breathing.

The business lesson is process-aware optimisation

The business value of PTA-GRPO is not that future enterprise models should always expose their plans to users. In many settings, exposing raw internal reasoning is undesirable, noisy, or unsafe. The value is that intermediate process structure can be trained, scored, and audited as part of model development.

That changes how organisations should think about AI workflows.

A typical enterprise AI deployment evaluates outputs: did the answer match the expected classification, summary, code patch, recommendation, or ticket resolution? This is necessary, but often insufficient. For complex work, two outputs can be equally correct while one is brittle, verbose, poorly grounded, or impossible to inspect. Conversely, a wrong final answer may come from a mostly sound process with one repairable failure point.

PTA-GRPO points toward a more mature evaluation pattern:

Business workflow	Process artefact worth scoring	Why final-answer scoring is not enough
Compliance triage	Issue taxonomy, rule mapping, evidence plan	Correct labels without traceability are hard to defend
Data analysis	Analysis plan, variable selection, test sequence	A correct chart can hide a fragile method
Coding agents	Patch plan, dependency impact map, test plan	Passing visible tests may miss architectural damage
Scientific review	Hypothesis decomposition, evidence hierarchy	A fluent conclusion may ignore weak experimental support
Procurement or strategy support	Option framing, constraint mapping, risk sequence	A recommendation without decision structure is boardroom confetti

The inference is not that PTA-GRPO directly solves these business workflows. The paper does not test enterprise agents, real compliance systems, messy databases, or human approval loops. The inference is that plan-quality rewards are a useful design pattern where intermediate structure predicts downstream success and where the organisation can define verifiable or at least defensible evaluation criteria.

That is the bridge from benchmark method to business practice: not “use this exact algorithm tomorrow,” but “stop treating intermediate reasoning as invisible exhaust.”

What the paper directly shows, and what Cognaptus infers

It is worth separating the evidence from the interpretation.

Layer	Statement	Confidence
Direct paper result	PTA-GRPO improves average benchmark performance over GRPO/DAPO and other baselines across the reported math model blocks	High within the reported setup
Direct paper result	Planning-aware SFT improves over SFT without planning for the two reported model blocks	Moderate to high within the reported setup
Direct paper result	The method shows multimodal and science benchmark gains over SRPO using Qwen2.5-7B-VL	Moderate within the reported setup
Direct paper result	Training time is close to GRPO/DAPO under the authors’ sampled-response budget	Moderate; implementation cost is not fully captured
Cognaptus inference	Process-level rewards may help enterprise AI systems where intermediate plans predict output reliability	Plausible but unproven
Cognaptus inference	Smaller or weaker models may benefit more from explicit planning supervision	Supported by the paper’s pattern, but domain-dependent
Uncertain	Whether the method transfers to open-ended business tasks without clean labels	Not established
Uncertain	Whether plan rewards improve factuality, safety, tool use, or accountability by themselves	Not established

This separation matters because AI papers often tempt readers into cargo-cult adoption. A method improves AIME scores, therefore your procurement assistant should produce a <plan> tag before buying steel. No. Please do not run a company like a benchmark leaderboard with stationery.

The paper supports a narrower but more valuable point: if a workflow depends on multi-step reasoning, and if intermediate structure can be evaluated, then training that structure may outperform merely rewarding final outputs.

The limitation is not planning. It is verifiability.

PTA-GRPO works best where plan usefulness can be estimated by downstream correctness. This is natural for mathematical reasoning because the final answer can be checked. It is less natural for tasks where correctness is delayed, subjective, contested, or multi-stakeholder.

A business plan can be coherent and still fail because the market moved. A legal argument can be well structured and still lose because the judge disagreed. A due-diligence memo can identify the right risks and still be commercially inconvenient, which is usually when everyone suddenly discovers “strategic alignment.”

This creates three boundaries.

First, PTA-GRPO does not remove the need for domain-specific reward design. If the reward is shallow, the model will optimise shallowly. A beautiful plan leading to a superficially approved answer is still a failure.

Second, explicit plans are not automatically faithful explanations. A generated plan can be useful as a control surface, but it is still model output. It should be validated against behaviour, evidence, and outcomes.

Third, stronger planning capability is dual-use. The paper’s impact statement notes that improved structured reasoning can help education, programming, science, and multimodal reasoning, but may also support misuse in complex problem-solving contexts. That is not a reason to avoid planning research. It is a reason to avoid pretending capability gains are governance gains.

The practical takeaway: reward the outline, or stop worshipping it

The paper’s most useful message is not “LLMs should plan.” It is that planning must be operationalised.

For AI builders, that means designing training and evaluation systems where intermediate artefacts are not merely displayed but tested. A plan should earn trust because it repeatedly improves execution, not because it looks managerial.

For enterprises, the implication is even more pointed. If an AI system is used for complex work, ask what intermediate structure predicts success. Then ask whether that structure is measured. If the answer is no, the organisation does not have process-aware AI. It has answer generation with nicer formatting.

PTA-GRPO is an academic method tested mostly on benchmark reasoning, not a turnkey enterprise architecture. But its mechanism is commercially legible: outcome-only optimisation leaves money on the table when the route matters. Process-aware optimisation can make models more reliable, more concise, and easier to diagnose.

The old prompt-engineering ritual was: “Think step by step.”

The better training question is: “Which steps actually help?”

That is where the profit is hiding. Not in the outline. In scoring whether the outline earns its keep.

Cognaptus: Automate the Present, Incubate the Future.

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Benteng Chen, Towsif Raiyan, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, and Sumon Biswas, “Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning,” arXiv:2510.01833v2, 25 May 2026, https://arxiv.org/abs/2510.01833. ↩︎

The real target is local reasoning myopia#

PTA-GRPO has two stages, and the first one is not optional decoration#

The reward asks a blunt question: did this plan lead to correct reasoning?#

The main benchmark result is stronger on weaker models and harder tasks#

The ablations say the method is a system, not a single trick#

Scaling helps, but it is not linear fairy dust#

The appendix mostly tests robustness, not a second thesis#

The multimodal and science results extend the claim, cautiously#

The self-plan result matters for deployment economics#

The business lesson is process-aware optimisation#

What the paper directly shows, and what Cognaptus infers#

The limitation is not planning. It is verifiability.#

The practical takeaway: reward the outline, or stop worshipping it#