TL;DR for operators

When an AI system has to execute a multi-step operational plan, the tempting move is to ask the LLM for the plan. This paper argues for a less glamorous and more useful pattern: let the LLM help shrink the search problem, then let a classical planner verify and compose the actual action sequence.1

The authors compare two ways of doing that. LLM4Inspire asks the model to choose from currently executable actions. It is the “what should I do next?” pattern. LLM4Predict asks the model to propose a constrained intermediate state between the current state and the goal. It is the “what milestone would make this easier to solve?” pattern. The second approach usually wins because it changes the shape of the search problem rather than merely nudging the next step.

The strongest result is not “LLMs can plan.” That would be the lazy headline, and also the wrong one. The paper’s system remains planner-first: LLM outputs are hints, candidate actions, or intermediate states, and accepted plans are still checked by Fast Downward inside a conflict-aware decomposition framework. The model is not promoted to operations director. It is hired as a search-space intern, which is probably where it belongs.

Across IPC benchmark domains, LLM4Predict performs best on Blocks, Logistics, and Depot: 49/50, 42/42, and 19/22 successful instances respectively. It does not improve over plain decomposition in Mystery, where action and object names are randomised, which is exactly the point: language-model guidance is most useful when names still carry domain meaning.

For enterprise use, the practical translation is straightforward. In logistics, warehouse planning, scheduling, field-service routing, compliance workflows, and other structured domains, LLMs should sit inside a verified planning stack. Let them suggest decompositions, milestones, or search hints. Do not let them own executability, constraint satisfaction, or final workflow approval unless the organisation enjoys debugging optimism at scale.

The familiar failure: the plan is valid-looking, not valid

Operations teams already know the difference between a plan that sounds sensible and a plan that can actually be executed.

A warehouse plan can say “load pallet A, dispatch truck B, then restock bay C.” That sentence may be fluent, tidy, and entirely impossible if truck B is already booked, pallet A is blocked behind another shipment, and bay C cannot receive inventory until a quality check clears. Natural language is forgiving. Operations are not.

Classical planners were built for the unforgiving part. Given an initial state, a goal state, and a domain model of valid actions, they search for an action sequence that actually transforms one into the other. The problem is scale. Add objects, goals, dependencies, and long action chains, and the search space expands brutally. Old-school planning does not fail because it lacks discipline. It fails because discipline becomes expensive.

The paper’s move is to avoid the false choice between “LLMs as magical planners” and “symbolic planners forever alone.” Instead, it asks where an LLM can reduce planner effort without corrupting planner guarantees. That placement question matters more than the branding. An LLM can be asked to generate a plan, select an action, propose a subgoal, rewrite a problem, rank options, or explain failure. These are not interchangeable roles. Some make the system safer. Some merely make the demo smoother.

This paper compares two roles that look similar from far away but behave differently inside the planning loop.

Two roles for the model: pick the next action, or split the problem

The paper’s comparison is between LLM4Inspire and LLM4Predict.

LLM4Inspire is the more familiar pattern. The system enumerates currently applicable actions from the domain model, gives those options to the LLM along with the current state, goal, and action history, and asks it to select a promising executable action. This makes the model a heuristic guide. It does not invent arbitrary moves. It chooses among actions that are already valid in the current state.

LLM4Predict is different. It asks the LLM to generate a small set of key predicates representing an intermediate state between the current state and the final goal. The planner then solves from the current state to that intermediate state, and from there toward the original goal. In other words, the LLM is not choosing the next step. It is proposing a milestone that may partition a hard problem into easier subproblems.

The distinction is worth spelling out because it maps cleanly onto enterprise agent design.

Pattern What the LLM does What the planner does Operational interpretation
LLM4Inspire Selects an executable action from available actions Verifies and continues solving from the updated state “Suggest the next valid move.”
LLM4Predict Proposes an intermediate state or subgoal Solves verified subproblems around that state “Suggest a useful milestone.”
LLM-only baseline Generates an action sequence directly No symbolic planning authority in the loop “Trust the fluent plan.” Usually adorable. Also risky.

The comparison matters because action selection and problem splitting attack different sources of difficulty. If the search tree is too wide, a good action hint can help. If the plan is too deep or structurally tangled, a good intermediate state can be more valuable because it changes the route through the search space.

The authors describe this as a difference between skipping levels in the search tree and partitioning the search space. LLM4Inspire compresses the path by choosing actions that may jump the system toward a better region. LLM4Predict tries to create a shorter bridge: solve to a midpoint, then solve from the midpoint onward. When the midpoint is well chosen, the complexity reduction is not cosmetic. It can be exponential because two shorter searches can be far cheaper than one long search.

That is the central business lesson hiding inside the technical machinery: the value of an LLM in planning may come less from “reasoning harder” and more from making the formal solver’s job smaller.

The planner still does the adult supervision

Before the LLM enters the loop, the framework performs conflict-aware decomposition. This is the part that keeps the paper from becoming another “we prompted the model harder” exercise.

The system parses the PDDL planning problem, builds dependency graphs over the initial state and goal state, identifies ordering constraints, detects cycles or conflicts, and generates mutually exclusive conditions to resolve them. The planner then decomposes goals into ordered subproblems. Fast Downward is used as the underlying symbolic solver.

The motivation is simple. Splitting a planning problem by goal atoms is not enough. Some subgoals interfere with each other. In Blocks, for example, placing one block correctly may require moving another block that was already part of a completed subgoal. Achieve goals in the wrong order, and the system undoes its own work. Very human, yes, but not ideal in software.

The conflict-detection module tries to prevent that by reasoning over dependency graphs. It updates the initial state and orders subgoals so later planning does not casually demolish earlier achievements. This structural layer is important because it means the LLM is not operating in a vacuum. It is embedded in a decomposition-and-composition workflow.

The paper’s soundness claim rests on that design. The framework does not accept LLM-generated plans as final plans. It accepts only planner-verified subplans and composes them. The authors also claim conditional completeness: if conflict detection removes cyclic dependencies and recursive decomposition eventually yields finitely many subproblems solvable by the underlying planner, the framework can solve the original problem.

That “if” is doing real work. Conditional completeness is not a universal guarantee. It says the architecture preserves solvability under certain termination and decomposability assumptions. Still, it is a useful discipline: the LLM can suggest, but the symbolic layer decides what becomes executable.

The evidence: prediction beats inspiration when semantics still mean something

The experiments use four International Planning Competition domains: Blocks, Logistics, Depot, and Mystery Round 1. These are not business workflows, but they are standard planning benchmarks with explicit domain models, actions, predicates, and goals.

The methods compared include Fast Downward, decomposition without LLM assistance, LLM-only baselines using DeepSeek-R1, GPT-5, and Claude Sonnet 4, RAP, LLM4Inspire, and LLM4Predict. LLM4Inspire and LLM4Predict use DeepSeek-R1 as the LLM component. Success is measured as a valid plan produced within a three-minute planner cutoff, with an important caveat: LLM-call time is excluded.

Here is the main success-rate table.

Method Blocks Logistics Depot Mystery
Fast Downward 26/50 17/42 5/22 15/30
Decomposition 40/50 42/42 15/22 15/30
DeepSeek-R1 only 35/50 13/42 4/22 0/30
GPT-5 only 50/50 9/42 10/22 0/30
Claude Sonnet 4 only 46/50 3/42 7/22 0/30
RAP 1/50 0/42 1/22 0/30
LLM4Inspire 37/50 42/42 17/22 15/30
LLM4Predict 49/50 42/42 19/22 15/30

The headline result is that LLM4Predict is strongest across the semantically meaningful domains. In Blocks, it solves 49 out of 50 instances, compared with 40 for decomposition alone and 37 for LLM4Inspire. In Logistics, both LLM-assisted modes and decomposition solve all 42 instances, meaning the decomposition structure already captures much of what matters. In Depot, LLM4Predict solves 19 out of 22, ahead of decomposition’s 15 and LLM4Inspire’s 17.

The Mystery result is the built-in nonsense detector. Mystery replaces action, predicate, and object names with random words. Classical planners do not care whether an action is named “load” or “glimflorp,” because the domain model specifies its preconditions and effects. LLMs do care, because semantic names are part of their guidance signal. In Mystery, LLM-only methods fail completely, and both LLM4Inspire and LLM4Predict stay at 15/30, no better than decomposition or Fast Downward.

That is not a footnote. It is the mechanism revealing itself. The LLM contribution is domain-sensitive guidance, not universal planning competence. When the names stop meaning anything, the model stops helping.

The ablations tell a more sober story than the leaderboard

A shallow reading of the table would crown LLM4Predict and move on. That would miss the more useful evidence.

First, decomposition alone is already powerful. Fast Downward solves 26/50 Blocks, 17/42 Logistics, 5/22 Depot, and 15/30 Mystery. Add decomposition without LLM assistance, and those numbers become 40/50, 42/42, 15/22, and 15/30. That is a large part of the improvement before any model-generated hint enters the system.

Second, LLM-only planning is inconsistent across domains. GPT-5 reaches 50/50 in Blocks, but only 9/42 in Logistics and 10/22 in Depot. DeepSeek-R1 does reasonably in Blocks but poorly in Logistics and Depot. Claude Sonnet 4 also drops sharply outside Blocks. All three score 0/30 in Mystery. The paper notes that GPT-5 appears robust in direct planning but that DeepSeek-R1 was chosen for the integrated variants because its outputs were better suited to format validation in these tasks. That is an implementation detail with real operational weight: a “smarter” model is not automatically the best component if it is less reliable as a machine-readable module.

Third, RAP performs poorly under the benchmark constraints. The authors attribute this to the resource cost of Monte Carlo Tree Search in complex problems and the number of LLM accesses required. In this paper, RAP is not a failed idea in general; it is a comparison showing that not every LLM-planning loop is practical under tight search budgets. Sampling more futures can sound impressive until the clock starts behaving like a CFO.

Fourth, LLM4Predict is not magic. In Depot, the paper reports that LLM4Predict solves instances 5 and 20 where LLM4Inspire fails, while LLM4Inspire solves instances 15 and 18 where LLM4Predict fails. The authors suggest that predicted intermediate states can sometimes interfere with previously achieved subgoals, producing invalid composition. This is exactly the kind of failure operators should care about: a milestone can simplify the search, but a bad milestone can pull the workflow through territory that violates prior commitments.

So the evidence supports a layered conclusion, not a victory lap. The best-performing system is not “LLM plus planner.” It is conflict-aware decomposition, planner verification, and LLM-assisted search-space reduction, with intermediate-state prediction often better than action inspiration when the domain semantics remain meaningful.

Why intermediate-state prediction is usually the better business pattern

In enterprise terms, LLM4Inspire resembles a copilot sitting beside a workflow engine and saying, “Choose this next valid action.” That can help, but it is local. It optimises one decision at a time.

LLM4Predict resembles a process analyst saying, “Before trying to reach the final goal, get the system into this intermediate configuration.” That is more strategic. It can turn a difficult end-to-end workflow into smaller verified workflow segments.

This distinction matters in business domains where the plan is not a single sentence but a constrained sequence:

  • route these goods while respecting vehicle capacity, fuel, customs windows, and delivery order;
  • schedule maintenance while preserving service coverage and parts availability;
  • allocate warehouse tasks while avoiding blocked aisles and conflicting forklift assignments;
  • run a financial operations workflow while respecting approvals, cutoffs, and audit states;
  • coordinate field-service jobs where skills, geography, equipment, and SLAs interact.

In those settings, the LLM should not be asked to “make the plan” in the loose conversational sense. It should be asked to propose useful intermediate representations: milestones, partial states, candidate decompositions, exception clusters, or likely subgoal orderings. Then a formal engine checks feasibility.

The ROI logic is also different. LLM4Inspire saves effort if it reduces poor branching choices. LLM4Predict saves effort if it reduces the depth and complexity of the problem itself. In operational systems, the latter is often more valuable because hard planning failures usually come from combinatorial explosion, not from a lack of friendly suggestions.

A useful enterprise design principle falls out:

Use the model where judgment narrows the search. Use the planner where rules determine truth.

That sounds obvious, which is how good architecture often looks after someone else has done the unpleasant work.

What the paper directly shows, and what Cognaptus infers

The distinction between evidence and inference is important here.

Layer What it says How strong it is
Direct paper result Conflict-aware decomposition improves planning success over Fast Downward in several IPC domains. Strong within the tested benchmark setup.
Direct paper result LLM4Predict generally outperforms LLM4Inspire in Blocks and Depot, while both tie in Logistics and Mystery. Strong for the reported experiments; not a universal theorem.
Direct paper result Mystery removes semantic names, and LLM guidance loses usefulness there. Strong evidence that model assistance depends on meaningful domain semantics.
Cognaptus inference Enterprise systems should use LLMs as constrained decomposition aids rather than final plan authorities. Plausible and operationally useful, but requires validation per workflow.
Cognaptus inference Intermediate-state prediction is often a better pattern than next-action selection for complex workflows. Supported by the paper’s mechanism and results, but not proven across all business domains.
Still uncertain Whether the approach works in dynamic, uncertain, partially observable, or human-in-the-loop environments. Not established by this paper.

This is not a minor distinction. If an organisation reads the paper as proof that LLM agents can autonomously plan complex operations, it will build the wrong system. If it reads the paper as evidence that LLMs can help formal planners by proposing constrained search reductions, it gets a useful architectural pattern.

The second reading is less cinematic. It is also less likely to strand a delivery truck in a logically impossible state.

The limitation that matters: the world here is still PDDL-shaped

The paper’s boundaries are clear and operationally significant.

The experiments are based on classical planning domains with explicit action models. The world state is represented in predicates. Actions have defined preconditions, add effects, and delete effects. That is a cleaner world than most business processes, where states are incomplete, systems disagree, people override rules, data arrives late, and “urgent” means someone important has stopped reading the policy.

The benchmark time limit is also narrower than a production cost model. The success metric uses a three-minute cutoff for valid plans, but LLM-call time is excluded because network speed and device performance can vary. That is reasonable for isolating solver behaviour, but operators cannot exclude model latency from reality. In production, LLM calls cost money, time, retries, monitoring, and occasionally patience.

The LLM-assisted variants use DeepSeek-R1 as the model component. The LLM-only comparison includes GPT-5 and Claude Sonnet 4, but the integrated architecture is not a full model sweep across all possible deployments. The paper’s own model-selection discussion is useful: output format reliability matters, not just raw planning ability. Still, production teams would need to test model choice, prompt stability, parsing robustness, and failure handling under their own data.

The Mystery domain is both a limitation and a warning label. When names are randomised, model guidance collapses. Many enterprise systems contain their own version of Mystery: cryptic SKU codes, legacy abbreviations, inconsistent labels, overloaded status fields, and process names invented by committees who mistook ambiguity for governance. If the semantic layer is poor, the LLM’s contribution will be poor unless it is grounded in richer metadata or domain-specific training.

Finally, the paper does not solve planning in uncertain, dynamic, or partially observable environments. The authors explicitly mark those as future work. For many real operations, that is not an edge case. It is Tuesday.

How to use this pattern without worshipping it

For an enterprise team building agentic workflow automation, the practical architecture would look something like this:

  1. Encode the business process in a formal or semi-formal action model wherever possible.
  2. Use deterministic validation for permissions, preconditions, state transitions, and constraints.
  3. Add decomposition logic to split large goals into ordered subgoals and detect conflicts.
  4. Let the LLM propose intermediate states, milestones, or candidate decompositions.
  5. Verify every accepted subplan with the planner or rule engine.
  6. Log failed intermediate states, not just failed final plans.
  7. Treat semantic quality as infrastructure: clean labels, domain glossaries, and process metadata are not documentation niceties; they are model fuel.

The last point deserves emphasis. LLM-assisted planning works best when the model can map names to plausible domain structure. That means the operational vocabulary matters. If internal systems describe a shipment as OBJ_17_STATUS_X9, do not expect the model to display divine insight. Give it meaningful representations or stop pretending language intelligence survives without language.

The broader lesson is that agentic AI should not be designed as a single brain issuing commands. It should be a stack: language for interpretation and decomposition, symbolic machinery for validity, monitoring for drift, and governance for authority. The LLM can help the stack move faster. It should not become the stack.

Conclusion: the best planner is not the loudest model

The paper’s title asks whether LLMs should inspire or predict. The experiments answer: prediction is usually the more powerful role, but only when it is constrained, domain-aware, and verified.

That answer is more useful than the usual binary argument about whether LLMs can plan. In serious operational systems, the question is not whether the model can produce a plausible plan. It often can. The question is whether its contribution reduces complexity without taking over responsibility for correctness.

This paper’s strongest contribution is architectural. It shows that LLMs can help classical planners not by replacing their formal machinery, but by making their search problems smaller. LLM4Inspire shows the value of next-action guidance. LLM4Predict shows the larger value of intermediate-state decomposition. Mystery shows the limit: without meaningful domain semantics, the model’s guidance loses its grip.

For business operators, that is the sober takeaway. Use the model to split the problem. Use the planner to certify the path. Let each component do the job it is actually good at.

A fluent plan is nice. A valid plan is better. A valid plan found faster is where the money starts paying attention.

Cognaptus: Automate the Present, Incubate the Future.


  1. Wenkai Yu, Jianhang Tang, Yang Zhang, Yixiong Feng, Celimuge Wu, Kebing Jin, and Hankz Hankui Zhuo, “Inspire or Predict? Exploring New Paradigms in Assisting Classical Planners with Large Language Models,” arXiv:2508.11524v2, 2026, https://arxiv.org/abs/2508.11524↩︎