Stepwise Think-Critique: Teaching LLMs to Doubt Themselves (Productively)

The useful part of doubt is timing

Doubt is not useful after the invoice is paid, the client report is sent, or the model has already produced a confident wrong answer with twelve decorative paragraphs of reasoning. At that point, “let us verify” becomes less like quality control and more like archaeology.

That is the business problem hiding inside Stepwise Think-Critique: Unified Reasoning and Self-Critique in LLMs.¹ The paper is not merely asking whether a language model can reason. We already have enough models that can produce long mathematical derivations with the emotional texture of a graduate student trapped in an exam hall. The sharper question is whether the model can criticize its own intermediate reasoning while it is still reasoning.

The proposed method, Stepwise Think-Critique, or STC, trains a single model to alternate between a reasoning step and a critique of that step. Each critique contains a short natural-language justification and a binary correctness score. In full mode, the output looks like a worked solution with an internal audit trail attached to every step. In compact mode, the critique is turned off and the model behaves more like a normal reasoning model.

This sounds small. It is not. It changes where verification lives.

Traditional reasoning pipelines usually fall into two camps. One camp lets the model reason and hopes the final answer survives contact with reality. The other camp adds a separate verifier or process reward model after generation. STC tries a third route: teach the same model to reason and inspect itself in one interleaved process. Not a separate judge in a black robe. More like a model with a tiny compliance officer sitting beside every algebraic move. Terrifying, yes. Also useful.

The important business interpretation is not “self-critique solves trust.” It does not. The paper shows an architecture pattern for cheaper, more inspectable reasoning under mathematical benchmarks. It does not prove enterprise-grade reliability, legal auditability, medical safety, or cross-domain correctness. If someone tries to sell it that way, check whether their critique module is also hallucinating.

Three ways to handle reasoning errors

The easiest way to understand STC is by comparing it with the two obvious alternatives.

Approach	What happens	Operational appeal	Operational weakness
Pure reasoning	One model produces reasoning and an answer	Simple, cheap, easy to deploy	Errors may appear persuasive and remain unlabeled
Post-hoc verification	A separate verifier reviews the completed trajectory	Can catch mistakes and rank candidate answers	Adds another model, context passing, scheduling, memory, latency, and failure points
Stepwise Think-Critique	One model alternates reasoning and critique during generation	Produces localized self-assessment and can guide selection among samples	The critic is still the model’s learned behavior, not independent truth

The paper’s Figure 1 frames this distinction clearly: pure reasoning produces a trajectory without assessment; post-hoc verification evaluates the trajectory afterward; STC interleaves the two inside one model. That comparison matters because it changes the engineering surface.

A separate verifier can be powerful, but it is another component in production. It must receive the right trace, parse the right format, fit inside context limits, return usable labels, and remain aligned with the generator’s evolving behavior. Anyone who has deployed multi-model workflows knows the pain: the architecture diagram looks elegant until one model times out, the verifier rejects a format variation, or the memory budget begins quietly eating the ROI.

STC’s pitch is not that one model is philosophically superior to two. It is that co-locating reasoning and critique can reduce pipeline complexity while giving users more interpretable traces. The model learns to produce a reasoning step, immediately judge it, and then continue. The critique can also be used at inference time to select among multiple sampled solutions.

That is the comparison-based heart of the paper. STC competes less with “models that think” than with “systems that think first and audit later.”

What STC actually trains the model to do

STC has two training stages.

First, the authors create supervised fine-tuning data. They sample 10,000 problems, generate one reasoning trajectory per problem using the base model, and then use GPT-5 to generate stepwise critiques for each reasoning step. They filter the data for structure and answer-level consistency, ending with 5,056 retained samples, about 51% of the original pool. This is a cold-start stage: teach the model the pattern of interleaving reasoning and critique before asking reinforcement learning to improve it.

Second, they train with reinforcement learning, using Grouped Reinforcement Policy Optimization. The reward design has three pieces:

Reward component	Likely purpose	What it encourages
Reasoning reward	Main evidence mechanism	Correct final answers
Critique-consistency reward	Core contribution	The final critique label should match whether the final answer is actually correct
Format reward	Implementation stabilizer	Reasoning and critique blocks should remain well-formed and extractable

The critique-consistency reward deserves special attention. The authors do not have human-labeled correctness labels for every intermediate reasoning step during training. Instead, they use the final answer’s verifiable correctness as supervision for the final critique. If the answer is wrong, the model should not confidently label the final answer as correct. That pressure then helps shape the critic behavior.

This is clever, but it is also the source of a boundary. The model is not receiving perfect step-level truth for every intermediate operation. It is learning critique behavior under a reward structure that can be checked at the answer level and evaluated with pseudo labels at the process level. Useful, yes. Omniscient, not even close.

The authors also separate optimization so critique rewards are back-propagated only to critique tokens, while reasoning and format rewards affect the full output. That detail is not just engineering ornament. It reflects the paper’s real mechanism: reasoning and critique are related but not identical capabilities. If the gradients fight each other, the model can become a better talker without becoming a better checker, or vice versa. The appendix notes a two-stage schedule designed to balance reasoning ability and critique ability, including alternating objectives to mitigate this conflict.

In plain business language: STC is trying to train one employee to both do the work and annotate the work, without letting the annotation job ruin the work itself.

The main result is not SFT; it is RL fixing the trade-off

The baseline model is DeepSeek-R1-Distill-Qwen-1.5B, abbreviated in the paper as DS-Qwen-1.5B. The authors evaluate on AIME24, AMC23, MATH-500, Minerva, and OlympiadBench. For reasoning quality, they report Pass@1 and Pass@8.

The headline result is that STC improves average Pass@1 from 41.2% for the base model to 48.5% in compact mode and 48.4% in full mode. Average Pass@8 rises from 60.7% for the base model to 68.9% in compact mode and 67.6% in full mode.

That improvement is meaningful, but the path matters more than the number.

SFT alone hurts reasoning. STC-SFT falls to 39.1% average Pass@1 and 59.1% average Pass@8. The paper attributes this to the output distribution shift caused by adding critique behavior. That is plausible: teaching a model to emit a new structured reasoning-plus-critique format can disrupt its original problem-solving behavior. In human terms, asking someone to solve the math problem while narrating every self-check can slow them down before it makes them better.

RL is what recovers and surpasses the baseline. After reinforcement learning, the model gains both reasoning performance and critique behavior. That is the paper’s first important evidence point: the contribution is not simply “format the output with critique tags.” The contribution is using reinforcement learning to make the reasoning and critique capabilities co-evolve.

Model	Average P@1	Average P@8	Interpretation
DS-Qwen-1.5B	41.2	60.7	Base reasoning model
STC-SFT	39.1	59.1	Learns critique format but temporarily weakens reasoning
STC compact	48.5	68.9	RL-trained model, critique disabled at inference
STC full	48.4	67.6	RL-trained model with interleaved critique enabled

The compact-versus-full result is operationally interesting. Full mode provides interpretability but consumes more tokens. Compact mode keeps reasoning performance roughly comparable while omitting critique. That means STC can be treated as a dual-mode system: run compact mode for ordinary throughput, turn on full mode when diagnosis, auditability, or review is worth the extra token budget.

That is a practical product design point, not just a benchmark detail.

The critic learns to reject more wrong answers, which is harder than praising correct ones

The most revealing results are not the reasoning scores. They are the critique scores.

The paper evaluates critique quality using Correct Accuracy, Error Accuracy, and their harmonic mean F1. Correct Accuracy measures whether the critic accepts correct items. Error Accuracy measures whether it rejects incorrect items. F1 prevents a degenerate model from looking good by saying “correct” to nearly everything.

This matters because an agreeable critic is nearly useless. A model that labels almost every solution as correct will score well on correct answers if most examples are correct. It will also politely let errors walk through the front door wearing a fake badge. Very enterprise.

At the answer level, STC-SFT shows exactly that positive bias. It reaches 98.8% Correct Accuracy but only 11.4% Error Accuracy, with an F1 of 20.4%. In other words, it is excellent at approving correct answers and terrible at rejecting wrong ones.

Adding RL without the critique reward helps only modestly: Error Accuracy rises to 21.2%, and F1 reaches 34.7%. The full STC model with critique-consistency reward raises Error Accuracy to 42.6% and F1 to 57.8%, while Correct Accuracy drops to 89.8%.

That trade-off is the point. STC becomes less blindly approving. It sacrifices some willingness to accept correct answers in exchange for much better detection of wrong answers. In operational settings, that is often exactly what you want from an audit signal. A critic that never says no is not a critic. It is a decorative stamp.

At the process level, the same pattern appears. Step-level evaluation uses GPT-5-mini as a judge because human step labels are unavailable. On the first 240 samples per dataset, STC-SFT shows 94.3% Correct Accuracy but only 31.1% Error Accuracy. STC improves Error Accuracy to 57.3% and F1 to 68.5%.

The authors also check the reliability of the automatic judge by comparing GPT-5-mini judgments with verifiable answer-level ground truth on a subset, reporting about 90% agreement. That supports using GPT-5-mini as a practical pseudo-label source, but it does not turn the evaluation into human-labeled ground truth. The distinction matters. The result is useful evidence, not a notarized certificate from the Ministry of Reasoning.

Test-time scaling: critique beats majority voting when the crowd is confidently wrong

The paper’s second business-relevant result is critique-guided test-time scaling.

The setup is familiar: generate multiple candidate solutions and choose one. The standard method is majority voting. If most samples produce the same answer, select that answer. This works when the correct answer tends to dominate. It fails when the model repeatedly falls into the same wrong pattern. More samples then amplify the wrong answer instead of rescuing the result. Democracy has known issues.

STC offers a different selection rule. Generate multiple solutions, keep those whose final answer critique score is 1, and then apply majority voting among the remaining candidates if needed. The authors compare this Best-of-K via Critique against ordinary majority voting and Pass@N, where Pass@N is treated as a practical upper bound because it assumes a perfect oracle can pick any correct solution among the samples.

On AIME24, critique-guided selection outperforms majority voting by 2.2% to 32.0% at the same sampling budget. On AMC23, the gains range from 0% to 12.8%. The precise gain depends on the number of sampled completions and the dataset.

The business interpretation is straightforward: the value of critique is not only explanation. It can also be routing. A model that can flag its own likely wrong answers can help choose among multiple outputs, reducing the chance that repeated but flawed reasoning wins by volume.

This is especially relevant for enterprise workflows that already sample multiple outputs: analytical reports, code generation, quantitative problem solving, structured extraction, and decision-support drafts. In such settings, critique-guided selection could sit between raw generation and final review. It is not a replacement for review. It is a triage layer that may make review less wasteful.

The appendix tests robustness, not a second thesis

The appendix contains details that are easy to overread. They should be treated as supporting evidence and boundary clarification.

Paper component	Likely purpose	What it supports	What it does not prove
Two-stage RL schedule	Implementation detail	Reasoning and critique objectives may conflict and need balancing	The schedule is optimal or general across model sizes
Confusion statistics and valid ratios	Robustness/context for critique metrics	STC predicts more negative labels, closer to observed negative ratios	All critiques are reliable in arbitrary domains
Dense reward experiment	Exploratory extension / ablation	Stepwise critique can provide small gains when critic quality is stronger	Dense rewards are already a mature training recipe
Compact-mode example	Implementation detail	The trained model can follow instruction to omit critiques	Compact mode always preserves behavior under production prompts
Reward hacking discussion	Sanity check	No observed collapse to all-correct or all-incorrect labels	Reward hacking is impossible

The dense reward experiment is particularly useful because it lowers the temperature of the paper’s own claim. The authors test using stepwise critique judgments as dense rewards for reasoning optimization. Average Pass@1 improves only slightly, from 48.4% to 49.6%, while average Pass@8 drops from 67.6% to 67.1%. Gains appear on AIME24, AMC23, and OlympiadBench, but not on MATH-500 or Minerva.

The authors connect this to critic quality. Dense rewards seem more helpful where the step-level critic has stronger error accuracy. That is exactly what one should expect: bad intermediate reward signals do not become good because they are frequent. They just become frequently bad.

For business readers, this is an important lesson. Adding more internal scoring does not automatically improve an AI workflow. The scoring signal must be accurate enough to shape behavior. Otherwise, the system gets more elaborate without getting more reliable. Congratulations, you have built bureaucracy in tensor form.

What enterprises can infer, and what they cannot

The paper directly shows three things.

First, a 1.5B reasoning model can be trained to interleave mathematical reasoning and self-critique in a structured format.

Second, RL with a critique-consistency reward can improve both reasoning performance and error detection compared with SFT-only and RL-without-critic variants.

Third, critique signals can improve test-time candidate selection compared with majority voting on selected math benchmarks.

Cognaptus can reasonably infer an architectural pathway from this: integrated self-critique may be useful in enterprise AI systems where the goal is not merely to generate an answer, but to expose local confidence, error locations, and candidate-selection signals. The practical value would be strongest in workflows where reasoning traces are already useful: financial modeling, legal memo drafting, technical troubleshooting, coding assistance, engineering calculations, procurement analysis, and policy review.

But the uncertainty boundary is strict.

This is mainly mathematical reasoning evidence on a 1.5B model. The paper does not validate STC on large proprietary models, multimodal tasks, regulated compliance workflows, messy enterprise documents, real-time tool use, customer conversations, or adversarial users. It also does not prove that self-critique equals calibrated uncertainty. A model can learn to produce useful critique labels while still being overconfident, underconfident, domain-fragile, or format-sensitive.

There is also an independence issue. An internal critic is not the same as an external verifier. If the same model family generates and judges the reasoning, shared blind spots may remain. In some business settings, that is acceptable because the purpose is triage and interpretability. In others, especially high-stakes compliance or safety, independent verification is still necessary.

A practical enterprise design would therefore treat STC-like behavior as one layer, not the whole assurance stack.

Use case	STC-like self-critique may help	External review still needed when
Internal analytical drafts	Identify weak steps before human review	Decisions affect capital allocation or legal exposure
Code generation	Flag suspicious logic branches or failed assumptions	Code touches production, security, payments, or data privacy
Financial calculations	Expose arithmetic or modeling inconsistencies	Outputs support investment recommendations or audit filings
Customer support reasoning	Show why a recommendation was made	Advice affects contracts, refunds, health, finance, or identity
Research summarization	Highlight uncertain inference chains	Source interpretation requires expert domain judgment

The cleanest framing is this: STC is not “trust the model because it doubts itself.” It is “make the model’s doubt observable, structured, and available for routing.”

That is much more useful. Also less likely to embarrass everyone later.

The product design lesson: compact by default, full when accountability is expensive

One of the paper’s underrated contributions is the dual-mode inference design. Full mode gives reasoning plus critique. Compact mode gives reasoning only. The paper reports comparable reasoning performance between the two modes after RL training.

For enterprise deployment, this suggests a practical pattern:

Use compact mode for low-risk, high-throughput tasks.
Use full mode when outputs are surprising, high-value, contested, or likely to be reviewed.
Use critique-guided Best-of-K when sampling multiple candidates.
Escalate to external verification when internal critique flags risk or when the domain requires independent assurance.

This is where the ROI logic appears. STC-like systems may reduce the cost of diagnosis, not merely the cost of generation. If a human reviewer can see which step the model itself considered questionable, review becomes more targeted. If candidate selection improves, fewer bad outputs reach the reviewer. If compact mode preserves performance, the organization does not need to pay full audit-token cost for every trivial query.

The paper does not provide enterprise cost benchmarks, so we should not invent them. But the mechanism points to a plausible economic advantage: use audit depth selectively.

That is a better product story than “self-critiquing AI is more trustworthy.” The latter is a slogan. The former is an implementation strategy.

The boundary: self-critique is a signal, not a guarantee

The most tempting misconception is that self-critique means self-certification. It does not.

STC improves the model’s ability to identify errors under the paper’s training and evaluation conditions. It makes reasoning traces more interpretable. It improves candidate selection. It shifts the critic away from naive approval. These are real contributions.

But a learned internal critique remains a learned internal critique. It can be wrong. It can miss shared blind spots. It can inherit benchmark-specific behavior. It can be sensitive to prompts, domains, length, and distribution shift. The process-level labels rely on GPT-5-mini pseudo judgments, validated indirectly with about 90% agreement at answer level, not with a large human-labeled stepwise dataset. The authors also validate primarily on a 1.5B model, with limited hyperparameter tuning and without large-scale or multimodal experiments. Training cost is non-trivial: the appendix reports about five days on 16 A100 40GB GPUs for 1,200 RL steps.

Those limitations do not weaken the paper’s core idea. They keep it in the right box.

The right box is “promising architecture for integrated reasoning audit.” The wrong box is “solved trust.” The former deserves attention. The latter deserves procurement skepticism and possibly a stronger coffee.

Conclusion: the model should not just answer; it should leave usable fingerprints

STC is valuable because it moves critique closer to the moment of reasoning. Instead of generating a long chain and asking another model to inspect the corpse afterward, it trains a single model to annotate its own steps as the solution unfolds.

The results are strongest where they should matter: RL improves the reasoning-critique trade-off, error detection becomes less embarrassingly polite, and critique-guided candidate selection beats majority voting on tested benchmarks. The appendix adds a useful dose of humility: dense reward shaping gives only small gains, and the method still depends on critic quality.

For business AI, the lesson is not to replace verifiers with self-confidence wearing a lab coat. The lesson is to design systems where reasoning outputs carry structured local audit signals. Sometimes that signal will reduce review cost. Sometimes it will improve candidate selection. Sometimes it will tell the human exactly where to look before the model’s elegant answer becomes an expensive mistake.

Productive doubt is not hesitation. It is instrumentation.

Cognaptus: Automate the Present, Incubate the Future.

Jiaqi Xu, Cuiling Lan, Xuejin Chen, and Yan Lu, “Stepwise Think-Critique: Unified Reasoning and Self-Critique in LLMs,” arXiv:2512.15662v3, 18 March 2026, https://arxiv.org/html/2512.15662. ↩︎

The useful part of doubt is timing#

Three ways to handle reasoning errors#

What STC actually trains the model to do#

The main result is not SFT; it is RL fixing the trade-off#

The critic learns to reject more wrong answers, which is harder than praising correct ones#

Test-time scaling: critique beats majority voting when the crowd is confidently wrong#

The appendix tests robustness, not a second thesis#

What enterprises can infer, and what they cannot#

The product design lesson: compact by default, full when accountability is expensive#

The boundary: self-critique is a signal, not a guarantee#

Conclusion: the model should not just answer; it should leave usable fingerprints#