Opening — Why this matters now

The AI world is getting bolder — talking about agentic workflows, self-directed automation, multimodal copilots, and the eventual merging of reasoning engines with operational systems. Yet beneath the hype lies a sobering question: Can today’s most powerful LLMs actually plan? Not philosophically, but in the cold, formal sense — step-by-step, verifiable, PDDL-style planning.

A new 2025 benchmark quietly answers: closer, but not quite. GPT‑5 is impressively competent, Gemini 2.5 is surprisingly robust, DeepSeek R1 is ambitious but inconsistent — and all three collectively demonstrate the gap between symbolic planners and probabilistic models is narrowing, though not disappearing.

For businesses considering agentic automation, this is a timely wake-up call. Long-horizon planning is where LLMs stumble — and where stakes are highest.

Background — The long arms of classical planners

In the planning community, PDDL domains (airport routing, pipeline logistics, elevator control, quantum-circuit layout, etc.) have been the gold-standard benchmark for decades. These environments are:

  • Formal — every symbol has precise semantics
  • Deterministic — no wiggle room for creative storytelling
  • Verifiable — every proposed plan must pass an unforgiving validator (VAL)

LLMs, meanwhile, are probabilistic pattern-matchers. For them, “reasoning” is a learned behaviour, not a guaranteed property. Past studies showed LLMs failing catastrophically on planning when confronted with symbolic noise or large instance sizes.

This new paper revisits the test with frontier models — GPT‑5, Gemini 2.5 Pro, and DeepSeek R1 — and fresh, contamination-resistant tasks from the 2023 IPC Learning Track.

Analysis — What the paper actually does

The authors evaluate three LLMs and one classical planner (LAMA) on 360 planning tasks across eight domains, each in two versions:

  1. Standard: readable PDDL names
  2. Obfuscated: all symbols replaced with random strings to eliminate semantic crutches

LLMs are prompted with:

  • domain file
  • task file
  • a checklist of common mistakes
  • two example solved domains

Outputs are fed into VAL to confirm correctness — ensuring the evaluation measures actual planning success, not plausible-looking text.

Findings — The results that actually matter

The headline result: GPT‑5 ties LAMA on standard tasks (205 vs. 204 solved tasks) — a milestone for LLMs. But the picture becomes more nuanced with obfuscation.

Table 1 — Total Tasks Solved

Model Standard Obfuscated
LAMA (baseline) 204 204
GPT‑5 205 152
Gemini 2.5 Pro 155 146
DeepSeek R1 157 93

LLM optimism meets its adversary: symbolic reasoning without semantic cues.

  • GPT‑5 remains strongest overall but drops 26% under obfuscation.
  • Gemini 2.5 loses far less — its architecture seems less dependent on token meaning.
  • DeepSeek R1 collapses. The RL-trained reasoning loops help in structured environments but fall apart under symbol permutations.

Chart — Plan Lengths (LLM Capabilities)

LLMs produce plans exceeding 500 steps, and in some domains surpass LAMA’s longest solutions. This is notable: a single incorrect action invalidates the entire plan, so long-horizon correctness is a hard proxy for multi-step reliability.

Chart — Reasoning Token Explosion (Gemini 2.5)

Gemini’s token usage spikes dramatically for obfuscated tasks, implying:

  • it is working harder
  • reasoning is less efficient
  • success depends on brute-force internal search rather than structural understanding

Interpretation: The gap is narrower, not closed

The models are good enough to impress, not good enough to trust.

Implications — What this means for real-world automation

1. Agentic automation is not “solved,” but now within strategic reach

GPT‑5-level models can handle medium-complexity planning, especially when tasks are labelled, structured, or learned from examples. But fully symbolic, high-stakes operational planning still belongs to classical systems.

2. Hybrid architectures will dominate

Future enterprise stacks will look like this:

  • LLMs for interpretation, decomposition, constraint rewriting, speculative heuristics
  • Symbolic planners for guaranteed correctness
  • Validators for runtime assurance
  • Orchestrators to resolve conflicts and agentic “hallucinations”

LLMs are becoming the glue, not the engine.

3. Cost structures matter

Running LAMA requires a single CPU core and 8 GiB of memory. Running GPT‑5 requires a datacenter. Running DeepSeek R1 requires a small moon.

This cost asymmetry will slow down adoption in embedded or resource-constrained applications.

4. Obfuscation is a proxy for robustness

If your workflow relies on strict symbolic semantics — security automation, logistics routing, manufacturing constraints — LLM-based planning remains brittle.

Gemini’s resilience is a promising exception, but not yet a paradigm shift.

5. Business takeaway: treat LLM planning as advisory, not authoritative

For now:

  • LLMs can propose plans
  • Classical engines must validate or replace them
  • Humans should supervise when stakes exceed inconvenience-level

Conclusion — The quiet truth behind the benchmark

Frontier LLMs are finally competitive on standard planning tasks, and that alone is a remarkable technical milestone. But symbolic reasoning remains an unforgiving domain, and the best models still show observable fragility under adversarial renaming.

In short:

  • LLMs can plan — when the world is labelled neatly for them.
  • Classical planners still rule when it isn’t.
  • The battle ahead is hybridization, not replacement.

Cognaptus: Automate the Present, Incubate the Future.