Opening — Why this matters now
For years, the received wisdom in AI planning was blunt: language models can’t really plan. Early benchmarks—especially Blocksworld—made that verdict look almost charitable. Models hallucinated invalid actions, violated preconditions, and confidently declared failure states as success. The field responded by bolting on external verifiers, symbolic planners, or human-in-the-loop corrections.
This paper asks a quieter, more uncomfortable question: what if the problem wasn’t that LLMs can’t plan—but that they weren’t taught to doubt themselves correctly?
Background — The long shadow of “LLMs can’t plan”
Classic planning lives in a world of explicit states, actions, and preconditions. Language models, by contrast, live in probability space. Prior work showed that when LLMs were asked to critique their own plans, they produced an alarming number of false positives—declaring broken plans correct and missing obvious violations. The conclusion was pessimistic: intrinsic self‑critique simply doesn’t work.
But that conclusion quietly bundled together capability and prompting. If you ask a model to vaguely “check its work,” you mostly get vibes. This paper separates the two.
Analysis — What the paper actually does
The core contribution is not a new model, dataset, or training regime. It’s a method: an iterative loop where the same LLM alternates between planning and formally critiquing its own output.
At each iteration:
-
Plan Generation – The model proposes a full plan given a problem definition.
-
Self‑Critique – The model re‑reads the plan using an explicit verification procedure:
- enumerate each action
- check its preconditions against the current state
- apply effects step by step
-
Revision – If the model declares the plan wrong, both the plan and the critique are appended to the next prompt.
No oracle. No external checker. No human hints.
Crucially, the critique prompt is not philosophical. It is procedural. The model is instructed to behave like a symbolic validator—one action at a time, no skipping, no summarizing.
Algorithm sketch
| Component | Role |
|---|---|
| Plan prompt | Produces candidate plan |
| Critique prompt | Enforces precondition-by-precondition verification |
| Memory (τ) | Accumulates failures and critiques |
| Early stopping | Stops when the model itself declares correctness |
This is intrinsic self‑improvement, not self‑reflection theatre.
Findings — The numbers that matter
Across classical planning benchmarks, the gains are not marginal—they are structural.
Blocksworld (3–5 blocks)
| Method | Accuracy |
|---|---|
| Baseline (no critique) | ~50% |
| Intrinsic self‑critique | ~89% |
| Oracle verifier | ~92% |
That gap between intrinsic critique and a perfect oracle is surprisingly small.
Other domains
| Domain | No Critique | Self‑Critique |
|---|---|---|
| Logistics (easy) | ~61% | ~93% |
| Mini‑Grid | ~58% | ~75% |
| Blocksworld (3–7) | ~57% | ~80% |
Even on Mystery Blocksworld—where predicates are intentionally obfuscated—self‑critique nearly doubles accuracy.
Why self‑consistency matters
When multiple critique samples vote on correctness, false positives drop sharply. Precision remains the main failure mode, but recall becomes remarkably strong. In other words: models still sometimes believe bad plans, but they almost never miss good ones.
Implications — What this really means
This paper quietly reframes the planning debate:
- LLMs don’t need to be planners to plan well.
- They need procedural self‑skepticism, not abstract reflection.
- The bottleneck wasn’t reasoning capacity—it was verification discipline.
For agent designers, this matters more than benchmark scores. Intrinsic self‑critique:
- removes dependence on external validators
- scales across domains with minimal tuning
- composes naturally with search, Tree‑of‑Thoughts, or MCTS
In business terms: it lowers orchestration cost while raising reliability.
Limitations — Where this breaks
This is not magic.
- Larger, more capable models benefit more; smaller models plateau quickly.
- Precision is still weaker than oracle verification.
- Context length grows with each failed iteration—hard limits still exist.
But these are engineering constraints, not conceptual ones.
Conclusion — Planning by arguing with yourself
The most interesting result here is not that accuracy improves. It’s how it improves. The model doesn’t learn new facts. It doesn’t get smarter. It simply learns to slow down and check.
That’s a lesson human organizations could probably use as well.
Cognaptus: Automate the Present, Incubate the Future.