Plans, Tokens, and Turing Dreams: Why LLMs Still Can’t Out-Plan a 15-Year-Old Classical Planner

Opening — Why this matters now

The AI world is getting bolder — talking about agentic workflows, self-directed automation, multimodal copilots, and the eventual merging of reasoning engines with operational systems. Yet beneath the hype lies a sobering question: Can today’s most powerful LLMs actually plan? Not philosophically, but in the cold, formal sense — step-by-step, verifiable, PDDL-style planning.

A new 2025 benchmark quietly answers: closer, but not quite. GPT‑5 is impressively competent, Gemini 2.5 is surprisingly robust, DeepSeek R1 is ambitious but inconsistent — and all three collectively demonstrate the gap between symbolic planners and probabilistic models is narrowing, though not disappearing.

For businesses considering agentic automation, this is a timely wake-up call. Long-horizon planning is where LLMs stumble — and where stakes are highest.

Background — The long arms of classical planners

In the planning community, PDDL domains (airport routing, pipeline logistics, elevator control, quantum-circuit layout, etc.) have been the gold-standard benchmark for decades. These environments are:

Formal — every symbol has precise semantics
Deterministic — no wiggle room for creative storytelling
Verifiable — every proposed plan must pass an unforgiving validator (VAL)

LLMs, meanwhile, are probabilistic pattern-matchers. For them, “reasoning” is a learned behaviour, not a guaranteed property. Past studies showed LLMs failing catastrophically on planning when confronted with symbolic noise or large instance sizes.

This new paper revisits the test with frontier models — GPT‑5, Gemini 2.5 Pro, and DeepSeek R1 — and fresh, contamination-resistant tasks from the 2023 IPC Learning Track.

Analysis — What the paper actually does

The authors evaluate three LLMs and one classical planner (LAMA) on 360 planning tasks across eight domains, each in two versions:

Standard: readable PDDL names
Obfuscated: all symbols replaced with random strings to eliminate semantic crutches

LLMs are prompted with:

domain file
task file
a checklist of common mistakes
two example solved domains

Outputs are fed into VAL to confirm correctness — ensuring the evaluation measures actual planning success, not plausible-looking text.

Findings — The results that actually matter

The headline result: GPT‑5 ties LAMA on standard tasks (205 vs. 204 solved tasks) — a milestone for LLMs. But the picture becomes more nuanced with obfuscation.

Table 1 — Total Tasks Solved

Model	Standard	Obfuscated
LAMA (baseline)	204	204
GPT‑5	205	152
Gemini 2.5 Pro	155	146
DeepSeek R1	157	93

LLM optimism meets its adversary: symbolic reasoning without semantic cues.

GPT‑5 remains strongest overall but drops 26% under obfuscation.
Gemini 2.5 loses far less — its architecture seems less dependent on token meaning.
DeepSeek R1 collapses. The RL-trained reasoning loops help in structured environments but fall apart under symbol permutations.

Chart — Plan Lengths (LLM Capabilities)

LLMs produce plans exceeding 500 steps, and in some domains surpass LAMA’s longest solutions. This is notable: a single incorrect action invalidates the entire plan, so long-horizon correctness is a hard proxy for multi-step reliability.

Chart — Reasoning Token Explosion (Gemini 2.5)

Gemini’s token usage spikes dramatically for obfuscated tasks, implying:

it is working harder
reasoning is less efficient
success depends on brute-force internal search rather than structural understanding

Interpretation: The gap is narrower, not closed

The models are good enough to impress, not good enough to trust.

Implications — What this means for real-world automation

1. Agentic automation is not “solved,” but now within strategic reach

GPT‑5-level models can handle medium-complexity planning, especially when tasks are labelled, structured, or learned from examples. But fully symbolic, high-stakes operational planning still belongs to classical systems.

2. Hybrid architectures will dominate

Future enterprise stacks will look like this:

LLMs for interpretation, decomposition, constraint rewriting, speculative heuristics
Symbolic planners for guaranteed correctness
Validators for runtime assurance
Orchestrators to resolve conflicts and agentic “hallucinations”

LLMs are becoming the glue, not the engine.

3. Cost structures matter

Running LAMA requires a single CPU core and 8 GiB of memory. Running GPT‑5 requires a datacenter. Running DeepSeek R1 requires a small moon.

This cost asymmetry will slow down adoption in embedded or resource-constrained applications.

4. Obfuscation is a proxy for robustness

If your workflow relies on strict symbolic semantics — security automation, logistics routing, manufacturing constraints — LLM-based planning remains brittle.

Gemini’s resilience is a promising exception, but not yet a paradigm shift.

5. Business takeaway: treat LLM planning as advisory, not authoritative

For now:

LLMs can propose plans
Classical engines must validate or replace them
Humans should supervise when stakes exceed inconvenience-level

Conclusion — The quiet truth behind the benchmark

Frontier LLMs are finally competitive on standard planning tasks, and that alone is a remarkable technical milestone. But symbolic reasoning remains an unforgiving domain, and the best models still show observable fragility under adversarial renaming.

In short:

LLMs can plan — when the world is labelled neatly for them.
Classical planners still rule when it isn’t.
The battle ahead is hybridization, not replacement.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The long arms of classical planners#

Analysis — What the paper actually does#

Findings — The results that actually matter#

Table 1 — Total Tasks Solved#

Chart — Plan Lengths (LLM Capabilities)#

Chart — Reasoning Token Explosion (Gemini 2.5)#

Interpretation: The gap is narrower, not closed#

Implications — What this means for real-world automation#

1. Agentic automation is not “solved,” but now within strategic reach#

2. Hybrid architectures will dominate#

3. Cost structures matter#

4. Obfuscation is a proxy for robustness#

5. Business takeaway: treat LLM planning as advisory, not authoritative#

Conclusion — The quiet truth behind the benchmark#