A busy agent is not necessarily a thinking agent.
Anyone who has watched an LLM agent narrate every tiny move knows the feeling. It reviews the goal. It drafts a plan. It revises the plan. It reconsiders the revision. Then, with exquisite deliberation, it clicks the wrong button. The transcript looks intelligent; the behaviour looks like a consultant trapped in a revolving door.
The paper “Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents” takes aim at this habit directly.1 Its central finding is simple enough to be dangerous: in long-horizon agent tasks, more explicit planning is not monotonically better. Agents that never plan underperform. Agents that plan before every action also underperform. Somewhere in between sits a “Goldilocks” frequency where planning helps without turning into self-inflicted friction.
That matters because much of the current agent stack still treats reasoning traces as a cheap garnish. Add chain-of-thought. Add ReAct. Add reflection. Add a planner. Add a second planner to judge the first planner, because apparently the first committee was insufficiently theatrical. The paper’s useful correction is that planning is not free intelligence. It is test-time compute, paid in tokens, latency, instability, and context pollution.
The business lesson is not “make agents think more.” It is “make agents spend thinking where it buys control.”
Planning has an advantage, but also a carrying cost
The paper’s most useful contribution is not a benchmark score. It is the mechanism it proposes for deciding when an agent should plan.
The authors frame an LLM agent operating in a partially observable sequential environment. At each step, the agent sees an observation, carries history, may have an existing plan, and must choose an action. In their conceptual decomposition, the agent has three roles:
| Component | What it decides | Operational interpretation |
|---|---|---|
| Decision policy | Whether to generate a new plan now | Should the agent spend extra compute? |
| Planning policy | What the plan should be | What strategic context should guide action? |
| Acting policy | Which action to take | What should be done next? |
Importantly, these are not three separate models. The implementation is a single monolithic LLM. If the model starts its output with a <plan> block, the system treats that as a decision to plan. If it does not, the model simply acts using the current context and any existing plan. This is elegant in the understated way that useful engineering often is: the “planner” is not bolted on as a decorative organ. It is a learned output mode.
The framework says an agent should replan only when the expected gain exceeds the cost. The gain is the planning advantage: the improvement in expected future reward from adopting a new plan rather than continuing with the old one. The costs have several flavours.
The first is obvious: more planning means more generated tokens. This is measurable and can be penalised directly.
The second is latency. The paper’s environments are turn-based, so latency is not the main experimental bottleneck. In business workflows, however, latency is rarely theoretical. A support agent, trading assistant, dispatch planner, or warehouse robot that spends too long “thinking” may produce a better paragraph and a worse service level.
The third cost is the one executives tend to underestimate: instability. Frequent replanning can make behaviour worse. The agent keeps changing its mind, revisiting previous states, abandoning good partial strategies, or oscillating between subgoals. The transcript looks reflective. The trajectory looks drunk.
This is where the paper’s mechanism-first framing earns its keep. Planning should not be measured only by whether the text of the plan sounds sensible. It should be measured by whether the plan improves the action trajectory net of its costs. That is a much colder test, and therefore a better one.
The Goldilocks result is a curve, not a slogan
The authors first test planning frequency without fine-tuning. They evaluate Llama-3.3-70B-Instruct in two environments: POGS, a custom partially observable graph-search task designed to isolate planning under uncertainty, and Crafter, a Minecraft-like grid-world requiring long-horizon survival, resource gathering, and crafting.
They compare a never-plan agent with fixed-frequency planners that generate a plan every $k$ steps. Each setting is evaluated over 100 seeds. This is main evidence: it tests whether planning frequency itself changes performance before introducing the learned dynamic planner.
The result is non-monotonic. Planning at intermediate intervals beats both extremes. Never planning leaves the agent too reactive. Always planning produces too much instability and token cost. In POGS, the appendix adds a useful diagnostic: always-planning agents show the highest backtracking. They do not merely spend more tokens; they revisit states more often, suggesting inefficient oscillation rather than disciplined exploration.
This matters because the naive theory of test-time compute says performance should rise as reasoning budget rises, perhaps with diminishing returns. The paper’s agent setting breaks that intuition. Long-horizon interaction adds plan drift, environmental surprises, and commitment costs. A plan can be stale, but replanning every second can be worse than stale. It is strategy by fidgeting.
The appendix extends the zero-shot analysis across Llama-3.1-8B, Llama-3.3-70B, Gemini 2.5 Flash, and TextWorld tasks. These are robustness and sensitivity tests, not the paper’s core proof. They support the broader claim that Goldilocks-like behaviour appears across architectures and domains, while also showing that the curve’s shape depends on the model and environment. Gemini, for example, appears strong enough on some simpler settings that more planning can become mostly overhead. On harder or noisier settings, planning provides moderate benefits but does not magically solve the task.
That is the right interpretation: the paper does not discover one universal planning interval. It discovers that planning frequency is an optimisation variable.
Prompting “plan only when needed” is not enough
One of the more practical findings is also one of the least glamorous: complex prompting did not reliably produce dynamic planning. The authors report that attempts to make models adaptively plan through instructions such as “plan only when necessary” were unreliable. Models tended to collapse into fixed patterns: always plan, never plan, or follow some regular rhythm.
That should sound familiar. Many agent systems currently rely on prompt-level governance for behaviours that are actually policy-level skills. “Be concise.” “Use tools only when helpful.” “Ask for clarification when necessary.” “Plan only when needed.” These are neat instructions. They are not necessarily learned control policies.
The paper’s replacement is a two-stage training pipeline.
First, supervised fine-tuning primes an 8B Llama-3.1-Instruct model on 1,024 Crafter trajectories generated by a 70B teacher. The teacher uses 16 different planning prompts and varied planning frequencies. The dynamic SFT target includes explicit <plan> blocks at planning steps and action-only outputs at non-planning steps. A matched no-plan version uses the same underlying action sequences but strips out the plan text.
That comparison is important. It isolates the role of plans during imitation learning. The action data are held constant; the difference is whether the model sees plans as part of the behavioural trace.
The dynamic SFT model improves task progression relative to the action-only model and shows lower divergence from the base model. The authors interpret plans as useful rationales that make actions easier to imitate, expose the model to planning structure, and regularise learning. That is plausible, and the ablations add nuance: appendix experiments suggest that asking the model to perform fuller world modelling during SFT is a distraction, increasing divergence and harming performance. The model benefits from planning traces, not from being asked to reconstruct everything about the environment like an overachieving intern with unlimited stationery.
Second, the authors apply reinforcement learning, using PPO, to refine the agent in Crafter. This is where the pipeline becomes more than imitation. RL is supposed to teach the agent when planning pays off in actual task reward.
The key contrast is sharp. A base model trained with RL and a dynamic planning prompt performs worse than a base RL no-plan agent. In other words, RL alone does not reliably invent good dynamic planning from a cold start. With SFT priming, however, the dynamic planning agent becomes more sample-efficient and outperforms the no-plan alternative. The planning behaviour needs scaffolding before RL can optimise it.
This is a useful warning for agent builders. “Let RL figure it out” is not a product strategy. It is a bill from your GPU provider with philosophical footnotes.
The best result is smaller, trained, and cheaper to run
The most business-readable result appears in the efficiency comparison.
The strongest zero-shot 70B configuration in Crafter plans every four steps and reaches a reward of 0.379 while generating 11,510.9 output tokens. The fine-tuned 8B SFT+RL dynamic planning agent reaches 0.387 reward while generating 1,714.3 output tokens. The paper summarises this as an 8B trained dynamic agent outperforming the best 70B zero-shot baseline while using far fewer tokens.
| Method | Size | Strategy | Reward | Output tokens |
|---|---|---|---|---|
| Zero-shot | 70B | Never plan | 0.343 | 559.6 |
| Zero-shot | 70B | Plan every 4 steps | 0.379 | 11,510.9 |
| Zero-shot | 70B | Always plan | 0.349 | 38,836.2 |
| SFT+RL | 8B | Never plan | 0.298 | 878.0 |
| SFT+RL | 8B | Dynamic planning | 0.387 | 1,714.3 |
The comparison should not be abused. This is not a general proof that every fine-tuned 8B agent beats every 70B agent. It is a result in Crafter under this setup. But it is still commercially interesting because it highlights a more mature view of agent economics.
The cheap version of the lesson is “small models can beat big models.” That is not quite it.
The better lesson is: training a smaller model to allocate cognition can beat prompting a larger model to spend cognition indiscriminately.
For businesses, that distinction matters. Model size is only one line item. Token volume, latency, reliability, observability, and intervention cost also matter. A large model that reasons constantly may appear robust in demos but become expensive and brittle in production. A smaller agent that knows when to surface explicit reasoning can be cheaper, faster, and easier to supervise.
That is the real ROI pathway: not merely fewer tokens, but fewer useless tokens.
Explicit plans are useful even when the agent learns to internalise them
One of the more subtle findings comes from the planning-cost penalty ablation. The authors experiment with penalties for plan token cost. As penalties increase, agents reduce planning frequency and plan length, but final normalised score is largely unaffected after sufficient training.
This result is easy to misread. It does not mean explicit planning is useless. It suggests that after training, some planning behaviour may become internalised in the policy. The agent no longer needs to externalise every strategic step to perform acceptably. In human terms, the beginner writes the checklist. The expert still knows the checklist, but does not recite it before tying shoelaces.
For business systems, that creates a design tension.
Implicit planning is efficient. Explicit planning is governable.
If an agent is acting inside a low-risk sandbox, implicit competence may be enough. If it is approving refunds, changing ad budgets, routing shipments, or recommending maintenance actions, operators may want a visible control surface. A plan is not just cognition. It is an interface.
The paper’s human-steering results make this point concrete. Autonomous agents did not fully solve Crafter under the reported compute constraints. But the SFT+RL dynamic planning agent could be guided by human-written plans and, in a human-in-the-loop setup, successfully complete Crafter by collecting diamonds. The appendix notes this took approximately 20 attempts in the human-collaboration setting and outperformed the best autonomous runs in the Best-of-N comparison.
The claim should be kept in proportion. This is not enterprise-grade human-AI collaboration solved in a Minecraft-like world. It is evidence that a model trained to produce and follow plans becomes more steerable through the same plan interface. The agent learns not only to write plans, but to condition action on plan-like structure supplied from outside.
That is the part product teams should notice. Explicit planning can serve three roles at once:
| Role of explicit plans | What the paper directly shows | Business interpretation |
|---|---|---|
| Performance scaffold | SFT with plan traces improves downstream learning compared with action-only traces | Planning examples may make agent behaviour easier to teach |
| Compute allocator | RL-trained dynamic agents learn to plan less frequently and more effectively | Reasoning budget can be policy-controlled rather than prompt-spammed |
| Control interface | Human-written plans steer the trained planning agent better than weaker baselines | Plans can become a human-operable handle for supervision and escalation |
A plan is therefore not merely “thought text.” It can be training signal, runtime budget decision, and governance surface. Conveniently, this is also a decent test for whether an agent architecture is serious or just wearing a lab coat.
The evidence stack: what each experiment is doing
The paper contains several experimental layers. They should not all be read as making the same claim.
| Evidence layer | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Zero-shot fixed-frequency planning in POGS and Crafter | Main evidence for the Goldilocks curve | Planning frequency has a non-monotonic effect on performance and cost | That a single universal planning interval exists |
| POGS backtracking analysis | Mechanism diagnostic | Excessive planning can produce oscillation and inefficient exploration | That backtracking explains all failures in richer environments |
| SFT dynamic vs SFT no-plan | Main evidence for plan traces as learning scaffold | Explicit plans improve imitation and reduce harmful drift from the base model | That any reasoning trace will help |
| World-modelling SFT ablation | Ablation | Extra environment reconstruction can distract from useful planning/action learning | That world models are generally bad for agents |
| RL after SFT | Main evidence for learned dynamic planning | SFT-primed agents can refine when and how to plan through RL | That RL alone can reliably discover planning from scratch |
| Planning-cost penalty ablation | Sensitivity/behavioural test | Agents can reduce explicit planning under cost pressure after training | That explicit plans are unnecessary for supervision |
| KL sensitivity analysis | Robustness test | The SFT+RL method works across intermediate KL settings, with failure modes at extremes | That optimisation is effortless across all models |
| Gemini/TextWorld extensions | Robustness extension | Goldilocks-like effects are not confined to one model/environment | That the same curve shape will appear everywhere |
| Human-plan steering | Exploratory extension and upper-bound probe | Trained planning agents can be guided by external plans to exceed autonomous runs | That real-world human-agent governance is solved |
This matters because the paper’s strongest contribution is not any one number. It is the convergence of mechanism and evidence: planning helps, planning costs, planning drifts, and planning can be learned as a runtime decision.
What this means for agent products
The immediate product implication is that agent systems need a planning policy, not merely a planning prompt.
In production, a well-designed agent should not expose only a binary setting: “reasoning on” or “reasoning off.” It should decide when explicit planning is justified. That decision can depend on uncertainty, novelty, task horizon, user stakes, tool cost, previous plan age, observed plan drift, and the need for human auditability.
A practical design pattern emerges:
-
Start with explicit plans for difficult or unfamiliar workflows. During early deployment, plans provide visibility and training data. They help operators see why the agent is doing what it is doing, and they create examples for later refinement.
-
Measure plan value, not plan eloquence. Track whether planning improves downstream task success, reduces retries, lowers tool misuse, or prevents escalation. A beautifully written plan that increases backtracking is not strategy. It is stationery.
-
Introduce plan-cost budgets. Tokens, latency, and tool calls should be budgeted. Planning should be allowed to expand when uncertainty rises and shrink when routines become predictable.
-
Treat replanning as a state-change response. Replan when the old plan is completed, contradicted, or made obsolete by new observations. Do not replan because another timestep has passed and the agent feels spiritually restless.
-
Keep explicit plans available as a control interface. Even when a trained agent can act implicitly, high-risk workflows may require visible plans for human steering, review, or override.
The business inference is clear but bounded: dynamic planning can reduce compute waste and improve controllability in long-horizon agent workflows. The paper directly demonstrates this in POGS, Crafter, and related text-game settings with Llama and Gemini models. It does not prove that the same training recipe will transfer unchanged to procurement approvals, clinical triage, financial execution, factory scheduling, or robotics.
Still, the shape of the problem transfers rather well. Enterprise workflows are full of partial observability, stale assumptions, interruptions, and long-horizon dependencies. That is exactly where always-planning becomes annoying and never-planning becomes reckless.
The boundary: game worlds are not businesses, but the cost curve is real
The limitations are worth stating cleanly.
First, the environments are artificial. POGS is deliberately synthetic. Crafter is richer but still a grid-world game. TextWorld adds another domain, but not an enterprise process with messy APIs, compliance constraints, angry users, and Friday-afternoon spreadsheets.
Second, the models and scales are specific. Fine-tuning focuses on Llama-3.1-8B-Instruct, while zero-shot comparisons use Llama-3.3-70B-Instruct and other models in appendix extensions. Scaling laws for optimal planning frequency remain open.
Third, the human-steering result is promising but exploratory. The successful diamond-collection run under human plans shows that the interface can matter. It does not tell us how to design governance protocols, approval thresholds, or accountability systems in production.
Fourth, evaluation of plan completion and adaptive replanning uses an LLM-as-a-judge. That is reasonable for qualitative plan analysis, and the paper discloses the setup, but it should not be treated as the same kind of evidence as environment reward.
Finally, the cost model is incomplete for real-time systems. The framework includes latency conceptually, but the experiments are turn-based. In live operations, latency is not a footnote. It is often the difference between automation and theatre.
These boundaries do not weaken the paper’s main point. They prevent us from laundering it into a universal agent recipe. A refreshing restraint, really.
The new default should be adaptive deliberation
The paper’s quiet provocation is that “reasoning” is too crude a product setting. Long-horizon agents need adaptive deliberation: the ability to decide when explicit thought is worth the bill.
That reframes test-time compute. It is not a magic lever where turning it up always helps. It is a resource allocation problem. Spend too little and the agent thrashes reactively. Spend too much and the agent drowns in its own revisions. Spend it at the right moments and a smaller trained model can outperform a larger prompted one at far lower token cost.
For Cognaptus readers, the practical takeaway is direct: do not buy agent intelligence by the kilogram. Architect when the agent should plan, when it should act, and when it should expose its reasoning for human control.
Planning is useful. Planning spam is not.
The future agent stack will not be the one that thinks the loudest. It will be the one that knows when thinking should become action.
Notes
Cognaptus: Automate the Present, Incubate the Future.
-
Davide Paglieri, Bartłomiej Cupiał, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel, “Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents,” arXiv:2509.03581, 2025. https://arxiv.org/abs/2509.03581 ↩︎