Opening — Why this matters now

For years, the received wisdom in AI planning was blunt: language models can’t really plan. Early benchmarks—especially Blocksworld—made that verdict look almost charitable. Models hallucinated invalid actions, violated preconditions, and confidently declared failure states as success. The field responded by bolting on external verifiers, symbolic planners, or human-in-the-loop corrections.

This paper asks a quieter, more uncomfortable question: what if the problem wasn’t that LLMs can’t plan—but that they weren’t taught to doubt themselves correctly?

Background — The long shadow of “LLMs can’t plan”

Classic planning lives in a world of explicit states, actions, and preconditions. Language models, by contrast, live in probability space. Prior work showed that when LLMs were asked to critique their own plans, they produced an alarming number of false positives—declaring broken plans correct and missing obvious violations. The conclusion was pessimistic: intrinsic self‑critique simply doesn’t work.

But that conclusion quietly bundled together capability and prompting. If you ask a model to vaguely “check its work,” you mostly get vibes. This paper separates the two.

Analysis — What the paper actually does

The core contribution is not a new model, dataset, or training regime. It’s a method: an iterative loop where the same LLM alternates between planning and formally critiquing its own output.

At each iteration:

  1. Plan Generation – The model proposes a full plan given a problem definition.

  2. Self‑Critique – The model re‑reads the plan using an explicit verification procedure:

    • enumerate each action
    • check its preconditions against the current state
    • apply effects step by step
  3. Revision – If the model declares the plan wrong, both the plan and the critique are appended to the next prompt.

No oracle. No external checker. No human hints.

Crucially, the critique prompt is not philosophical. It is procedural. The model is instructed to behave like a symbolic validator—one action at a time, no skipping, no summarizing.

Algorithm sketch

Component Role
Plan prompt Produces candidate plan
Critique prompt Enforces precondition-by-precondition verification
Memory (τ) Accumulates failures and critiques
Early stopping Stops when the model itself declares correctness

This is intrinsic self‑improvement, not self‑reflection theatre.

Findings — The numbers that matter

Across classical planning benchmarks, the gains are not marginal—they are structural.

Blocksworld (3–5 blocks)

Method Accuracy
Baseline (no critique) ~50%
Intrinsic self‑critique ~89%
Oracle verifier ~92%

That gap between intrinsic critique and a perfect oracle is surprisingly small.

Other domains

Domain No Critique Self‑Critique
Logistics (easy) ~61% ~93%
Mini‑Grid ~58% ~75%
Blocksworld (3–7) ~57% ~80%

Even on Mystery Blocksworld—where predicates are intentionally obfuscated—self‑critique nearly doubles accuracy.

Why self‑consistency matters

When multiple critique samples vote on correctness, false positives drop sharply. Precision remains the main failure mode, but recall becomes remarkably strong. In other words: models still sometimes believe bad plans, but they almost never miss good ones.

Implications — What this really means

This paper quietly reframes the planning debate:

  • LLMs don’t need to be planners to plan well.
  • They need procedural self‑skepticism, not abstract reflection.
  • The bottleneck wasn’t reasoning capacity—it was verification discipline.

For agent designers, this matters more than benchmark scores. Intrinsic self‑critique:

  • removes dependence on external validators
  • scales across domains with minimal tuning
  • composes naturally with search, Tree‑of‑Thoughts, or MCTS

In business terms: it lowers orchestration cost while raising reliability.

Limitations — Where this breaks

This is not magic.

  • Larger, more capable models benefit more; smaller models plateau quickly.
  • Precision is still weaker than oracle verification.
  • Context length grows with each failed iteration—hard limits still exist.

But these are engineering constraints, not conceptual ones.

Conclusion — Planning by arguing with yourself

The most interesting result here is not that accuracy improves. It’s how it improves. The model doesn’t learn new facts. It doesn’t get smarter. It simply learns to slow down and check.

That’s a lesson human organizations could probably use as well.

Cognaptus: Automate the Present, Incubate the Future.