“Please double-check your work” is one of the least expensive quality-control systems ever invented. It is also one of the least dependable.
A person who overlooked a constraint the first time may overlook it again. A language model is no different, except that it can produce a longer and more persuasive explanation of why the overlooked constraint was never important.
This is why previous research on intrinsic self-correction has often been disappointing. Ask a model to reconsider its answer, and it may preserve the original error, replace a correct answer with an incorrect one, or confidently approve a broken plan. Reflection, supplied as a vague instruction, is mostly an invitation to generate more language.
The paper Enhancing LLM Planning Capabilities through Intrinsic Self-Critique takes a more disciplined approach.1 Instead of asking a model to contemplate whether its plan feels correct, the authors instruct it to verify every action against explicit rules, calculate the resulting state, and continue until the entire plan has been checked.
The result is a large improvement across several planning benchmarks. On one Blocksworld dataset, accuracy rises from 49.8% without critique to 85.5% with intrinsic self-critique, and to 89.3% when multiple critiques vote on whether the plan should be accepted.
The useful lesson is not that language models have finally developed an inner conscience. It is that self-critique becomes useful when it stops resembling introspection and starts resembling an audit.
Planning gives self-critique something concrete to inspect
The paper studies planning problems in which an agent must move from an initial state to a specified goal through a valid sequence of actions.
In Blocksworld, for example, the agent rearranges blocks using actions such as picking up, putting down, stacking, and unstacking. Each action has preconditions. A block cannot be picked up unless it is clear, on the table, and the agent’s hand is empty. Each valid action also changes the state: the hand may no longer be empty, one block may no longer be clear, and another relationship may become true.
Logistics and Mini-Grid use different environments, but the basic structure is similar:
- a defined initial state;
- a set of available actions;
- explicit preconditions for each action;
- predictable effects after an action;
- a goal state that must eventually be reached.
This structure matters because it makes criticism operational.
For an open-ended question such as “Is this strategy sensible?”, there may be no agreed sequence of checks and no definitive final state. A model asked to critique the strategy must decide what standards to apply while simultaneously applying them.
A formal planning problem is less forgiving but more inspectable. The critic can ask three specific questions at every step:
- What are the preconditions of this action?
- Are those preconditions true in the current state?
- What state results after the action is applied?
The critic is therefore not being asked to develop better judgment in the abstract. It is being asked to execute a defined verification procedure.
That distinction is the foundation of the paper.
The loop turns the model into a temporary plan simulator
The method alternates between two roles performed by the same underlying LLM.
First, the model generates a complete candidate plan. Then it receives the problem, the plan, the action definitions, and a highly structured critique instruction.
The critique prompt tells the model to inspect every action sequentially. For each action, it must identify the relevant preconditions, determine whether they hold, apply the action’s effects, and provide the resulting state. It must not skip steps. At the end, it must issue one of three literal assessments:
- the plan is correct;
- the plan is wrong;
- the goal was not reached.
If the critic approves the plan, the process stops. If it rejects the plan, the failed plan and its critique are added to the context, and the model generates another candidate. The process continues until the model approves a plan or reaches the maximum number of iterations, set to ten in the experiments.
The loop can be summarized as:
Generate candidate plan
↓
Check every action against the current state
↓
Apply effects and construct the next state
↓
Approve plan?
├── Yes → return plan
└── No → retain failure and generate a revision
This is intrinsic self-critique because no external validator supplies feedback during the correction loop. The critic’s judgment comes from the model itself.
There is an important qualification. The researchers still use a PDDL validator to measure whether final plans are actually correct. They also use validator-consistent examples when constructing the few-shot critic prompt. The contribution is therefore not the abolition of external verification from experimental evaluation. It is the demonstration that an LLM can generate useful corrective feedback during inference without receiving the validator’s answer.
That is narrower than “the model can reliably judge itself.” It is also considerably more useful.
The prompt does not request reflection; it specifies an audit procedure
The method contains several elements that may initially look like ordinary prompt engineering:
| Element | Role in the loop | What it prevents |
|---|---|---|
| Domain definition | Provides actions, preconditions, and effects | Criticism based on improvised rules |
| Sequential action checking | Forces the critic to inspect the entire plan | Approval based on a plausible-looking prefix |
| Explicit state updates | Carries consequences from one action to the next | Evaluating each action in isolation |
| Required final classification | Converts the critique into an actionable decision | Ambiguous prose that cannot control the loop |
| Retained failures and critiques | Gives later generations evidence about previous mistakes | Repeating an unchanged failed plan |
| Optional critique voting | Requires several critics to support approval | Premature acceptance after one mistaken judgment |
The model is not merely asked to explain an answer after producing it. It is asked to simulate the execution of its proposed plan.
That difference becomes visible in the ablation study.
On the Blocksworld 3–7 validation set, a zero-shot critic using the full verification procedure improves accuracy from 55.7% at the first generation to 79.5% after the iterative process. Removing the domain definition lowers final accuracy to 74.4%. Removing the numbered three-step verification instruction lowers it further to 64.0%.
The most revealing result comes from removing the instruction to verify each action. Accuracy begins at 56.1% and ends at 57.5%.
In other words, almost the entire benefit disappears when the model is allowed to critique the plan without performing the tedious part.
This is not a decorative ablation. It identifies the mechanism.
The useful capability is not generic self-doubt. It is exhaustive procedural checking.
Examples help less than a clearly specified checking routine
The paper also compares an eight-shot critic, which receives worked verification examples, with a zero-shot critic that receives only the domain definition and checking instructions.
Without self-consistency voting, the eight-shot critic reaches 79.7% accuracy after the iterative process. The zero-shot critic reaches 79.5%.
The difference is negligible relative to the confidence intervals.
That result changes the practical interpretation of the method. The main source of improvement is not a large collection of demonstrations teaching the model how to criticize. It is the explicit decomposition of criticism into actions, preconditions, and state transitions.
Few-shot examples may still help in unfamiliar domains or with inconsistent output formats. The paper does not show that demonstrations are universally unnecessary. It shows that, within this planning setup, procedural specification carries much more weight than additional critique examples.
For businesses, this is encouraging and inconvenient in roughly equal measure.
It is encouraging because organizations may not need a large labelled dataset of good and bad critiques. It is inconvenient because they still need to describe the workflow precisely enough for each proposed action to be checked. A vague process documented in thirty slides remains vague after being pasted into an agent prompt.
The main results show large gains, but difficulty still wins
The strongest result comes from Blocksworld with three to five blocks.
| Method | Accuracy |
|---|---|
| No critique | 49.8% |
| Intrinsic self-critique | 85.5% |
| Self-critique with five-vote consistency | 89.3% |
| Oracle feedback | 91.5% |
The oracle condition uses a real validator to guide the iterative correction process. It represents what the plan generator can achieve when the accept-or-reject signal is dependable.
The small gap between the self-consistent critic and the oracle is notable. On this benchmark, most of the available improvement can be captured without giving the correction loop an external answer key.
The result is not confined to one Blocksworld dataset.
| Dataset | No critique | Self-critique | Oracle |
|---|---|---|---|
| Logistics | 60.7% | 93.2% | 95.0% |
| Logistics Hard | 18.9% | 32.8% | 38.8% |
| Mini-Grid | 57.7% | 75.2% | 79.8% |
| Mini-Grid Hard | 39.7% | 43.5% | 52.3% |
| Blocksworld 3–7 | 57.2% | 79.5% | 92.7% |
These results support two conclusions at once.
First, the method generalizes beyond the easiest Blocksworld setting. It produces substantial gains in logistics, navigation, and more complex block configurations.
Second, self-critique does not make problem difficulty disappear. Easy Logistics rises to 93.2%, while Logistics Hard remains at 32.8%. Mini-Grid improves strongly, while Mini-Grid Hard gains only 3.8 percentage points.
The loop is better at recovering plans that the model is already capable of producing than at creating planning ability from nothing. A critic can redirect a capable generator. It cannot reliably rescue a generator that lacks the capacity to find a valid route through the problem.
The oracle results reinforce this interpretation. On Blocksworld 3–7, the oracle reaches 92.7%, well above intrinsic self-critique at 79.5%. The model can often generate a correct revision when told accurately that the previous plan failed. The remaining bottleneck is the critic’s ability to recognize failure.
On Mini-Grid Hard, even the oracle reaches only 52.3%. There, better feedback helps, but the generation problem itself remains difficult.
Most gains arrive early; later rounds purchase smaller improvements
Across Blocksworld, Mini-Grid, and Logistics, a large share of the improvement arrives after the first critique-and-revision round. Further iterations continue to raise accuracy, but the gains gradually flatten.
This pattern has an obvious operational implication: ten iterations should not become a default product setting merely because the experiment permits ten.
An efficient deployment would treat critique depth as a decision variable. A low-risk workflow might use one critique pass. A high-consequence workflow might permit several revisions or escalate uncertain cases. Continuing until the context window is crowded with previous failures is technically possible, but so is scheduling every internal meeting for four hours.
The authors cap the process at ten iterations partly to control runtime and context length. If the prompt exceeds the model’s context limit, the loop terminates and returns the previous plan, potentially reducing accuracy.
The appendix examines this issue on Mini-Grid. With a 32,000-token context limit, performance plateaus earlier than with a one-million-token context. Longer context allows the process to retain more examples, failed plans, and critiques across later iterations.
This is best understood as a resource-sensitivity test, not a second thesis. It shows that iterative self-critique competes for context and that richer memory can extend its gains. It does not establish that purchasing the largest available context window is always economical.
Voting makes approval more conservative
The critic’s main failure mode is a false positive: accepting an invalid plan as correct.
False negatives—rejecting correct plans—occur much less frequently. That asymmetry matters because approval stops the process. A false negative wastes another iteration. A false positive returns a broken plan.
The paper addresses this with self-consistency. Rather than relying on a single critique, it samples multiple independent critiques and aggregates their judgments. A tie is treated as rejection, causing the model to revise the plan again.
On the Blocksworld 3–7 validation set:
| Critique setup | Final accuracy | LLM calls across 1,000 problems |
|---|---|---|
| Zero-shot critic, no voting | 79.5% | 6.1k |
| Eight-shot critic, no voting | 79.7% | 6.3k |
| Eight-shot critic, two votes | 84.5% | 12.2k |
| Eight-shot critic, five votes | 84.6% | 14.0k |
Voting adds about five percentage points over a single critique. Increasing from two votes to five produces almost no additional accuracy in this experiment, while adding further calls.
That is a useful result for system design. Self-consistency is not a ceremonial feature to be maximized. It is a reliability lever with diminishing returns.
The paper notes that parallel critiques need not increase latency because they can run simultaneously. They do, however, increase inference volume, infrastructure demand, and cost. In practice, latency may also depend on the slowest parallel response and the capacity available to serve all requests.
A sensible deployment pattern would therefore use voting selectively. Routine cases can receive one critic. Expensive voting can be reserved for decisions where erroneous approval is materially worse than another model call.
The method transfers across strong models, not equally across all models
The authors test the procedure with multiple foundation models rather than treating the result as a Gemini-specific curiosity.
On Blocksworld 3–5:
| Model | No critique | Self-critique |
|---|---|---|
| GPT-4o | 42.8% | 64.2% |
| Claude 3.5 Sonnet | 68.0% | 89.5% |
These experiments serve as a transfer test. They indicate that the method is not dependent on one provider’s model or one peculiar response style.
The appendix provides a less flattering but equally important result. Gemma 2–27B shows little improvement on Blocksworld and only modest improvement on the easier Logistics setting. Even an external validator leaves a large performance gap.
A self-critique loop therefore has a capability threshold. The model must be able to understand the rules, track state transitions, identify violations, and generate a better plan after receiving criticism. A weaker model may fail at both generation and verification, producing two unreliable opinions for the price of two.
This is not a reason to discard smaller models. It is a reason to test the entire loop rather than assuming that a prompt pattern transfers unchanged across model sizes.
Natural-language planning improves, but formal structure carries much of the result
Most of the paper’s strongest experiments use PDDL, a formal language for representing planning domains, actions, states, and goals.
That choice is not incidental. PDDL gives the critic a clean rule system to apply.
The appendix compares PDDL with natural-language Blocksworld prompts using one-shot setups:
| Representation and prompt | No critique | Self-critique |
|---|---|---|
| Natural language, formatted | 18.5% | 19.2% |
| Natural language, chain-of-thought and formatted | 20.3% | 29.7% |
| PDDL, formatted | 40.3% | 47.3% |
| PDDL, chain-of-thought and formatted | 39.2% | 65.0% |
Intrinsic self-critique still improves the stronger natural-language setup, from 20.3% to 29.7%. The method is therefore not restricted entirely to formal syntax.
However, the PDDL result is much stronger. With chain-of-thought and formatting, self-critique reaches 65.0%.
The business implication is easy to miss. Companies may hope to add a “critic agent” on top of existing natural-language procedures and receive the benchmark gains reported for formal planning. The paper does not support that expectation.
A critic performs best when the operating rules have already been converted into something close to an executable specification: defined actions, explicit prerequisites, predictable effects, and a clear objective. The more ambiguity left inside prose, the more freedom the critic has to approve whatever the generator proposed.
The appendix tests boundaries rather than adding a second claim
Several supplementary experiments clarify what the main result does and does not establish.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Other foundation models | Transfer test | The method benefits GPT-4o and Claude 3.5 Sonnet | Every model can self-correct effectively |
| Gemma 2–27B | Capability-boundary exploration | Weaker models may gain little | Small models can never benefit |
| Natural language versus PDDL | Representation robustness test | The method can help outside formal PDDL | Unstructured business prose is sufficient |
| Longer Mini-Grid context | Resource-sensitivity test | Context limits can constrain later iterations | Maximum context is always cost-effective |
| AutoPlanBench domains | Broader-domain evidence | Improvements appear across additional planning tasks | Results are highly precise; each task has only 21 instances |
| Error analysis | Failure-mode diagnosis | False-positive approval is the dominant critic error | The critic is safe enough for unsupervised high-stakes use |
AutoPlanBench is particularly useful as breadth evidence. The self-critique method improves performance across all reported domains, sometimes dramatically. But each task contains only 21 instances, producing wide uncertainty intervals. The table demonstrates reach, not a precise ranking of methods.
The paper also notes that the comparison with AutoPlanBench’s “Act” method is not direct because Act receives golden validator feedback at each step. A method that is told where it failed is solving a different problem from a critic that must detect failure itself.
Business value begins where rules are clear but validators are inconvenient
The paper directly shows that intrinsic self-critique improves accuracy on formal planning benchmarks. It does not directly test procurement, scheduling, compliance reviews, customer onboarding, or operational incident response.
The bridge to business use is nevertheless plausible when a workflow shares the same structure:
- the current state can be represented;
- available actions are defined;
- each action has prerequisites;
- actions produce predictable state changes;
- success has an observable definition.
Potential applications include scheduling under explicit constraints, approval routing, logistics coordination, structured troubleshooting, fulfilment workflows, configuration changes, and compliance procedures with well-defined requirements.
A practical architecture might look like this:
| Paper mechanism | Operational analogue |
|---|---|
| Domain definition | Policy and workflow rules |
| Initial planning state | Current case, order, schedule, or system condition |
| Candidate plan | Proposed sequence of operational actions |
| Stepwise self-critique | Rule-by-rule procedural audit |
| Resulting-state calculation | Updated workflow or system state after each action |
| Failure history | Retained rejected plans and identified violations |
| Self-consistency voting | Multiple independent approvals for higher-risk cases |
| External validator or human review | Escalation for cases with executable checks or high consequences |
The first implementation task is therefore not choosing the critic’s personality. It is formalizing the workflow.
Organizations frequently possess rules without possessing a usable rule system. Requirements are distributed across policy documents, emails, employee memory, spreadsheets, and exceptions introduced after somebody important complained. An LLM can read these materials, but reading them is not the same as receiving an unambiguous domain definition.
Before an intrinsic critic can behave like a validator, the organization must decide what valid means.
The economic trade-off is extra inference for fewer invalid plans
The method does not reduce computation. It adds a critique call after each candidate plan, may repeat the generation-and-critique cycle several times, and can multiply critique calls through voting.
Its business value must therefore be evaluated against the cost of failure.
Suppose a single-pass agent produces an invalid operational plan often enough to require human review. A self-critique loop may reduce that review burden or prevent costly execution errors. In that setting, several additional calls may be inexpensive.
For a low-value, high-volume task, the same loop may be uneconomical. A five-vote critic that adds only a small reliability gain could cost more than the errors it prevents.
The appropriate comparison is not:
Is intrinsic self-critique cheaper than one model call?
It clearly is not.
The useful comparison is:
Is intrinsic self-critique cheaper than the combination of invalid-plan costs, manual review, specialist-validator development, and downstream recovery?
The paper provides evidence about accuracy and call volume. It does not calculate business return on investment. Any deployment still requires organization-specific measurement.
A cost-conscious design could use staged verification:
- Generate a candidate plan.
- Run one procedural critique.
- Accept low-risk plans that pass clearly.
- Use multiple critics when approval is uncertain or consequential.
- Escalate cases with ambiguous rules, repeated failure, or high potential loss.
- Use deterministic validators whenever they are readily available.
That last point deserves emphasis. When a reliable executable validator already exists, replacing it with an LLM critic would be an unusually creative way to reduce certainty. Intrinsic self-critique is most valuable where the rules can be articulated but a complete external verifier is unavailable, costly to build, or difficult to integrate.
The method’s boundaries are structural, not decorative
The paper establishes a strong result within a particular class of problems. Several boundaries materially affect how far it should be generalized.
First, the method depends on formalizable constraints. It is well suited to checking whether a proposed action is allowed in a known state. It is less suited to deciding whether an ambiguous objective, disputed policy, or subjective strategy is correct.
Second, false-positive approvals remain the principal error. Voting reduces them but does not eliminate them. High-stakes actions still require deterministic checks, constrained execution, or human authorization.
Third, stronger models benefit more. Self-critique is not a substitute for baseline capability.
Fourth, difficult planning tasks remain difficult. The loop improves the use of existing capability more reliably than it creates missing capability.
Fifth, each revision adds model calls and expands the context with earlier plans and critiques. More iterations may improve accuracy, but the marginal gain declines while cost and memory pressure rise.
Finally, the experiments use model checkpoints from October 2024 and primarily evaluate symbolic planning benchmarks. The paper is evidence for a mechanism, not a universal certification for autonomous agents.
These limitations do not reduce the result to a prompt trick. They identify the conditions under which the trick becomes an engineering method.
Useful self-talk is closer to accounting than therapy
The paper’s most important contribution is not the discovery that LLMs can improve after receiving their own feedback. Models have been producing feedback about themselves for years, usually with the confidence of an employee reviewing a report five minutes before sending it.
The contribution is showing what makes that feedback useful.
The critic receives the rules. It checks every action. It tracks the changing state. It records failures. It treats approval as a decision that may require several independent votes.
Under those conditions, intrinsic self-critique can move planning accuracy much closer to an oracle-guided process without receiving oracle feedback during correction.
That does not mean an LLM can reflect its way out of every mistake. It means that, for sufficiently structured tasks, the model can be made to perform both sides of a disciplined review process: propose the plan, then audit the plan against an explicit operating model.
The model is still talking to itself. The difference is that someone finally gave the conversation an agenda.
Cognaptus: Automate the Present, Incubate the Future.
-
Bernd Bohnet et al., “Enhancing LLM Planning Capabilities through Intrinsic Self-Critique,” arXiv:2512.24103, https://arxiv.org/abs/2512.24103. ↩︎