Plans, Tokens, and Turing Dreams: Why LLMs Still Can’t Out-Plan a 15-Year-Old Classical Planner

TL;DR for operators

A new benchmark does not say that LLMs are hopeless at planning. That would be too easy, and also false. It says something more useful: frontier models are now strong enough to solve many formal planning tasks, but their competence still weakens when the task stops giving them semantically meaningful labels.¹

On standard PDDL tasks, GPT-5 solves 205 out of 360 problems, essentially matching LAMA’s 204. That is a real milestone. On obfuscated versions of the same tasks, where names are replaced by random strings, GPT-5 drops to 152. Gemini 2.5 Pro drops only slightly, from 155 to 146, which is interesting. DeepSeek R1 drops from 157 to 93, which is also interesting, though less flattering.

The business meaning is not “throw away planners and let the chatbot run the warehouse.” The better reading is architectural. Use LLMs to translate messy goals, propose candidate plans, explain constraints, and help users interact with planning systems. Use classical planners, validators, simulators, workflow engines, and policy checks to decide what is executable.

The useful question is no longer “can LLMs plan?” They plainly can, sometimes. The question is: when the workflow becomes symbolic, long-horizon, and unforgiving, who gets final authority? The benchmark’s answer is: not the token model alone. Not yet. Possibly not in that form.

Planning is where charming language goes to be audited

A plan sounds easy in business conversation. “Move inventory from A to B.” “Schedule technicians around constraints.” “Patch the vulnerable systems without breaking production.” “Route vehicles, satisfy SLAs, and please avoid inventing a forklift.”

The difficulty is that a real plan is not a paragraph with confidence. It is a sequence of actions where every action changes the state of the world, and the next action is only legal if the previous ones made its preconditions true. One wrong step does not make the plan “mostly useful.” It makes the plan invalid. Planning is brutal that way. It has no appreciation for persuasive prose.

That is why classical planning has remained stubbornly relevant. PDDL, the Planning Domain Definition Language, lets researchers define objects, predicates, actions, initial states, and goals in a formal way. A planner then searches for an action sequence that gets from the initial state to the goal. If the sequence is valid, a validator can check it. If it is not valid, the machine does not offer a tasteful apology. It fails.

The paper at the centre of this article evaluates DeepSeek R1, Gemini 2.5 Pro, and GPT-5 against LAMA, a landmark-guided planner introduced in 2010.² LAMA is not fashionable. It does not have a product keynote. It does not claim to “reason like a PhD.” It searches. Rather effectively, as it turns out.

What the benchmark actually tests

The authors evaluate end-to-end planning: the model receives a PDDL domain and task description, then must output a plan. No tools are allowed. No planner is called behind the curtain. The generated plans are validated with VAL, so the score is not based on whether the output looks plausible to a human reader. It is based on whether the plan actually works.

The experiment uses 360 newly generated tasks from eight domains in the IPC 2023 Learning Track: Blocksworld, Childsnack, Floortile, Miconic, Rovers, Sokoban, Spanner, and Transport. Each domain has 45 tasks. The authors generate fresh instances using IPC-style parameter distributions to reduce data contamination risk. That matters because planning benchmarks are old enough to have been digested by the internet, and therefore possibly by the models. The models may not have memorised the exact tasks, but the shadows of old benchmarks have a way of turning up in training data. Funny how that happens.

The clever part is the obfuscation test. The authors run both standard and obfuscated versions of the tasks. In the obfuscated version, actions, predicates, and objects are renamed with random strings. A human-readable predicate such as in-package-truck stops whispering its meaning to the model. The structure remains the same. The semantics, from the planner’s perspective, remain the same. The friendly English handles disappear.

That distinction is the whole article.

A classical planner does not care whether the action is called load-truck or xq17_zz9. It reasons over formal preconditions and effects. A language model, by contrast, is trained on tokens whose names carry statistical meaning. Rename the world into nonsense and you learn how much of its “reasoning” was riding on linguistic familiarity.

The headline result is progress, not replacement

The most impressive result is simple: on standard tasks, GPT-5 solves 205 out of 360 problems, while LAMA solves 204. That is not a rounding-error achievement. Earlier LLM planning studies often showed models struggling badly on formal planning, especially when semantic cues were removed or when tasks scaled beyond toy instances.³

Here is the compact version of the paper’s main coverage result:

System	Standard tasks solved	Obfuscated tasks solved	Interpretation
LAMA	204	204	Invariant to symbol renaming
GPT-5	205	152	Strong standard performance, weaker under pure-symbol pressure
Gemini 2.5 Pro	155	146	Lower standard score than GPT-5, unusually robust to obfuscation
DeepSeek R1	157	93	Competitive on standard tasks, much more fragile when labels disappear

The table is easy to misread. GPT-5 beating LAMA by one task on standard domains does not mean GPT-5 is now a better planner. It means GPT-5 can solve a comparable number of these prompted, standard, labelled benchmark instances under the paper’s setup. That is still significant. It is not a coronation.

The obfuscated column is the tax bill.

GPT-5 loses 53 solved tasks under obfuscation. DeepSeek R1 loses 64. Gemini 2.5 Pro loses only 9, which is the most intriguing result in the paper. Gemini’s lower standard score makes the robustness result less theatrical, but more technically interesting. It suggests that not all frontier models depend on token semantics in the same way.

Still, LAMA does not move. It solves 204 in both settings. The old planner sits there, unfazed by nonsense names, like an accountant receiving a rebranded invoice.

Obfuscation is not a party trick; it tests where the model is leaning

The point of obfuscation is not to be adversarial for sport. It separates two sources of performance that are usually entangled.

A model may solve a planning task because it has learned something structurally useful about actions, preconditions, goals, and state transitions. Or it may solve the task because the names are helpful. pickup, drive, passenger, floor, truck, and package carry a lot of prior knowledge. Those words make a symbolic problem look like a story.

In business settings, this distinction matters because many operational systems do not use clean language. They use internal codes, legacy labels, abbreviated fields, vendor-specific terms, and occasionally naming schemes designed by someone who has since left the company and taken mercy with them. If an agent’s planning ability depends heavily on meaningful labels, then the implementation risk is not abstract. It is sitting in the ERP schema.

The paper’s result implies a more disciplined view:

What the paper shows	What Cognaptus infers for business use	Boundary
GPT-5 is competitive with LAMA on standard PDDL coverage	LLMs are becoming useful candidate-plan generators for structured workflows	Coverage is not the same as guaranteed executability in production
All LLMs degrade under obfuscation	Token semantics still help models more than vendors may enjoy admitting	Obfuscation is a proxy, not a full model of enterprise data messiness
Gemini 2.5 Pro is unusually robust to obfuscation	Model selection should test symbolic robustness, not just benchmark headlines	Robustness here appears in one benchmark setup, not all planning domains
LAMA is invariant to symbol renaming	Formal planners remain valuable when correctness depends on structure, not language	Classical planning still requires a formal model

The misconception to kill is that “reasoning model” means “planner.” It does not. A reasoning-optimised LLM may spend more tokens exploring, checking, or explaining a path. That can improve performance. It does not automatically give the model a sound transition system, a complete search procedure, or a verifier. Calling the output “reasoning” does not make every intermediate state legal. Language remains annoyingly non-contractual.

Long plans are the strongest evidence of real progress

The paper reports that LLMs can generate valid plans with more than 500 steps. That deserves more attention than the one-task GPT-5 headline.

Long-horizon planning is hard because validity compounds. If a 20-step plan has one illegal action, it fails. If a 500-step plan has one illegal action, it also fails, but it had 480 extra opportunities to embarrass itself first. A valid long plan therefore suggests that the model is doing more than producing a few locally plausible actions.

This is where the result becomes genuinely interesting. The models are not merely completing tiny textbook examples. They can maintain structured sequences over hundreds of actions in at least some domains. That is the optimistic side of the paper, and it should not be lazily dismissed by anyone still running a 2023 mental model of LLM planning.

But plan length also sharpens the reliability problem. In production, a long generated plan creates audit burden. Every step needs validation, simulation, or execution monitoring. The more ambitious the agent becomes, the more expensive it is to check. That is not a philosophical objection; it is an operations budget.

A useful enterprise planning stack therefore does not ask the LLM to be both generator and judge. It separates roles:

User goal / messy request
        ↓
LLM interprets intent and drafts structured constraints
        ↓
Planner / optimiser searches over formal state space
        ↓
Validator checks executability and policy compliance
        ↓
Workflow engine executes with monitoring and rollback
        ↓
LLM explains status, exceptions, and alternatives

That is not as glamorous as “autonomous agent.” It is also much less likely to ship nonsense into the warehouse.

Cost is not a footnote; it is part of the result

The paper also reports experimental cost and resource asymmetry. LAMA was run on a single CPU core with 8 GiB of memory and a 30-minute limit per task. The LLM runs used official APIs with default parameters and no tools. The authors report approximately $100.47 for GPT-5 experiments and $9.13 for DeepSeek R1; Gemini 2.5 Pro used the free tier. They also note that DeepSeek R1, as a 671B-parameter mixture-of-experts model with 37B active parameters, is estimated to require over 1,000 GiB of GPU memory for inference.

That comparison is not perfectly apples-to-apples. API pricing, provider load, internal hardware, batching, and hidden inference optimisations all muddy the water. Still, the direction is not subtle. A specialised planner is cheap and reliable inside its formal lane. A frontier model is broad and expensive, and still needs checking.

This changes the ROI question. The relevant comparison is not “which system solved more tasks?” It is:

Decision question	Wrong procurement answer	Better engineering answer
Can an LLM produce candidate plans?	“Yes, therefore deploy it as planner.”	“Yes, but validate before execution.”
Can a classical planner solve the formal task?	“It is old technology.”	“Good. Old technology with guarantees is called infrastructure.”
Does the business need natural-language flexibility?	“Use only symbolic systems.”	“Use LLMs to translate and negotiate constraints.”
Are tasks safety-critical or financially material?	“Monitor with logs after the fact.”	“Use hard validators before execution.”
Is latency or compute constrained?	“Bigger model.”	“Smaller model plus planner, validator, or cached policy.”

In other words, the business value is not that LLMs replace planners. The value is that LLMs may reduce the human friction of using planners. They can help turn messy requests into formal models, suggest constraints users forgot to state, explain why no plan exists, or generate heuristics that improve search. That last path is already emerging in research on LLM-generated heuristics for classical planning, where the LLM writes code that helps a planner search rather than pretending to be the planner itself.⁴

That is the more mature architecture: less theatre, more machinery.

The planner is not obsolete; the interface is changing

Classical planners have an obvious weakness: someone has to model the domain. PDDL files are not written by vibes. They require exact predicates, actions, preconditions, effects, objects, and goals. In real companies, that modelling cost is often the barrier. Operations teams understand the process. IT systems contain the data. Few people want to sit between them and lovingly encode the ontology of a pallet.

This is where LLMs are genuinely useful. They are strong at language interfaces, schema interpretation, example-driven translation, and interactive clarification. Systems such as NL2Plan point in this direction: use the LLM to extract planning structure from minimal text, then hand the resulting formal representation to a classical planner.⁵ That does not make the LLM the final planner. It makes the LLM the assistant to the modelling process.

Kambhampati and colleagues describe a related view through LLM-Modulo frameworks: let the language model propose, translate, and refine candidates, while external critics or verifiers enforce hard constraints.⁶ This is exactly the pattern business systems need. Not “trust the model.” Not “ban the model.” Use the model where approximation is useful; use formal machinery where approximation is expensive.

The agentic automation market will eventually learn this, probably after paying for several dashboards showing a green check mark beside an invalid plan.

Where this result applies — and where it does not

The benchmark studies classical planning: deterministic actions, full observability, discrete symbolic states, and single-agent plan generation. That is a clean setting. Enterprise reality is often dirtier: partial information, changing objectives, human approvals, stochastic execution, missing data, and incentives no PDDL file should be forced to represent before coffee.

So the paper does not prove that LLMs fail at all real-world planning. Nor does it prove that classical planners can solve every enterprise workflow. The boundary is narrower and more useful.

The paper directly shows that frontier LLMs have improved substantially on formal PDDL planning, especially on standard labelled tasks. It also shows that when semantic labels are removed, the models still degrade, while LAMA remains unchanged.

Cognaptus infers that production systems should treat LLM planning as advisory unless a separate mechanism validates the plan. This inference is strongest for domains where actions have explicit preconditions, consequences, safety rules, resource limits, or compliance constraints. It is weaker for softer planning tasks such as brainstorming a meeting agenda, sketching a sales outreach sequence, or drafting a project checklist, where partial usefulness may be acceptable.

What remains uncertain is whether future models will internalise enough formal structure to close the obfuscation gap without external planners. Gemini 2.5 Pro’s robustness suggests that progress is not uniform across model families. But even if the gap narrows, the verification question remains. In operations, correctness is not a personality trait. It is a system property.

The practical takeaway: give the LLM a badge, not the keys

This paper is good news for builders of agentic systems. It means LLMs are getting closer to useful formal planning behaviour. It also means the serious design pattern is hybrid.

For operators, the rule is straightforward:

Let LLMs interpret goals, ask clarifying questions, draft constraints, and generate candidate plans.
Let planners, solvers, validators, simulators, and workflow engines decide what can actually run.
Measure success by executable outcomes, not by elegant reasoning traces.
Test robustness with renamed variables, synthetic tasks, and ugly internal schemas, because production rarely names things nicely.
Treat cost and latency as part of planning performance, not as procurement’s little surprise.

The 15-year-old planner in the title is not a museum piece. It is a reminder that intelligence in business systems is not always the newest model. Sometimes it is the boring component that refuses to hallucinate.

LLMs are becoming better planners. They are also becoming better at revealing exactly why planning was never just text generation in a nicer jacket.

Cognaptus: Automate the Present, Incubate the Future.

Augusto B. Corrêa, André G. Pereira, and Jendrik Seipp, “The 2025 Planning Performance of Frontier Large Language Models,” arXiv:2511.09378v1, 2025. https://arxiv.org/abs/2511.09378 ↩︎
Silvia Richter and Matthias Westphal, “The LAMA Planner: Guiding Cost-Based Anytime Planning with Landmarks,” Journal of Artificial Intelligence Research, 39:127–177, 2010. DOI: 10.1613/jair.2972. ↩︎
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati, “PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change,” arXiv:2206.10498, 2022; NeurIPS 2023. https://arxiv.org/abs/2206.10498 ↩︎
Augusto B. Corrêa, André G. Pereira, and Jendrik Seipp, “Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code,” arXiv:2503.18809, 2025. https://arxiv.org/abs/2503.18809 ↩︎
Elliot Gestrin, Marco Kuhlmann, and Jendrik Seipp, “NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions,” arXiv:2405.04215, 2024. https://arxiv.org/abs/2405.04215 ↩︎
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B. Murthy, “Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks,” arXiv:2402.01817, 2024; ICML 2024. https://arxiv.org/abs/2402.01817 ↩︎

TL;DR for operators#

Planning is where charming language goes to be audited#

What the benchmark actually tests#

The headline result is progress, not replacement#

Obfuscation is not a party trick; it tests where the model is leaning#

Long plans are the strongest evidence of real progress#

Cost is not a footnote; it is part of the result#

The planner is not obsolete; the interface is changing#

Where this result applies — and where it does not#

The practical takeaway: give the LLM a badge, not the keys#