Opening — Why this matters now
The AI world is getting bolder — talking about agentic workflows, self-directed automation, multimodal copilots, and the eventual merging of reasoning engines with operational systems. Yet beneath the hype lies a sobering question: Can today’s most powerful LLMs actually plan? Not philosophically, but in the cold, formal sense — step-by-step, verifiable, PDDL-style planning.
A new 2025 benchmark quietly answers: closer, but not quite. GPT‑5 is impressively competent, Gemini 2.5 is surprisingly robust, DeepSeek R1 is ambitious but inconsistent — and all three collectively demonstrate the gap between symbolic planners and probabilistic models is narrowing, though not disappearing.
For businesses considering agentic automation, this is a timely wake-up call. Long-horizon planning is where LLMs stumble — and where stakes are highest.
Background — The long arms of classical planners
In the planning community, PDDL domains (airport routing, pipeline logistics, elevator control, quantum-circuit layout, etc.) have been the gold-standard benchmark for decades. These environments are:
- Formal — every symbol has precise semantics
- Deterministic — no wiggle room for creative storytelling
- Verifiable — every proposed plan must pass an unforgiving validator (VAL)
LLMs, meanwhile, are probabilistic pattern-matchers. For them, “reasoning” is a learned behaviour, not a guaranteed property. Past studies showed LLMs failing catastrophically on planning when confronted with symbolic noise or large instance sizes.
This new paper revisits the test with frontier models — GPT‑5, Gemini 2.5 Pro, and DeepSeek R1 — and fresh, contamination-resistant tasks from the 2023 IPC Learning Track.
Analysis — What the paper actually does
The authors evaluate three LLMs and one classical planner (LAMA) on 360 planning tasks across eight domains, each in two versions:
- Standard: readable PDDL names
- Obfuscated: all symbols replaced with random strings to eliminate semantic crutches
LLMs are prompted with:
- domain file
- task file
- a checklist of common mistakes
- two example solved domains
Outputs are fed into VAL to confirm correctness — ensuring the evaluation measures actual planning success, not plausible-looking text.
Findings — The results that actually matter
The headline result: GPT‑5 ties LAMA on standard tasks (205 vs. 204 solved tasks) — a milestone for LLMs. But the picture becomes more nuanced with obfuscation.
Table 1 — Total Tasks Solved
| Model | Standard | Obfuscated |
|---|---|---|
| LAMA (baseline) | 204 | 204 |
| GPT‑5 | 205 | 152 |
| Gemini 2.5 Pro | 155 | 146 |
| DeepSeek R1 | 157 | 93 |
LLM optimism meets its adversary: symbolic reasoning without semantic cues.
- GPT‑5 remains strongest overall but drops 26% under obfuscation.
- Gemini 2.5 loses far less — its architecture seems less dependent on token meaning.
- DeepSeek R1 collapses. The RL-trained reasoning loops help in structured environments but fall apart under symbol permutations.
Chart — Plan Lengths (LLM Capabilities)
LLMs produce plans exceeding 500 steps, and in some domains surpass LAMA’s longest solutions. This is notable: a single incorrect action invalidates the entire plan, so long-horizon correctness is a hard proxy for multi-step reliability.
Chart — Reasoning Token Explosion (Gemini 2.5)
Gemini’s token usage spikes dramatically for obfuscated tasks, implying:
- it is working harder
- reasoning is less efficient
- success depends on brute-force internal search rather than structural understanding
Interpretation: The gap is narrower, not closed
The models are good enough to impress, not good enough to trust.
Implications — What this means for real-world automation
1. Agentic automation is not “solved,” but now within strategic reach
GPT‑5-level models can handle medium-complexity planning, especially when tasks are labelled, structured, or learned from examples. But fully symbolic, high-stakes operational planning still belongs to classical systems.
2. Hybrid architectures will dominate
Future enterprise stacks will look like this:
- LLMs for interpretation, decomposition, constraint rewriting, speculative heuristics
- Symbolic planners for guaranteed correctness
- Validators for runtime assurance
- Orchestrators to resolve conflicts and agentic “hallucinations”
LLMs are becoming the glue, not the engine.
3. Cost structures matter
Running LAMA requires a single CPU core and 8 GiB of memory. Running GPT‑5 requires a datacenter. Running DeepSeek R1 requires a small moon.
This cost asymmetry will slow down adoption in embedded or resource-constrained applications.
4. Obfuscation is a proxy for robustness
If your workflow relies on strict symbolic semantics — security automation, logistics routing, manufacturing constraints — LLM-based planning remains brittle.
Gemini’s resilience is a promising exception, but not yet a paradigm shift.
5. Business takeaway: treat LLM planning as advisory, not authoritative
For now:
- LLMs can propose plans
- Classical engines must validate or replace them
- Humans should supervise when stakes exceed inconvenience-level
Conclusion — The quiet truth behind the benchmark
Frontier LLMs are finally competitive on standard planning tasks, and that alone is a remarkable technical milestone. But symbolic reasoning remains an unforgiving domain, and the best models still show observable fragility under adversarial renaming.
In short:
- LLMs can plan — when the world is labelled neatly for them.
- Classical planners still rule when it isn’t.
- The battle ahead is hybridization, not replacement.
Cognaptus: Automate the Present, Incubate the Future.