Stepwise Think-Critique: Teaching LLMs to Doubt Themselves (Productively)

Opening — Why this matters now

Large Language Models have learned how to think out loud. What they still struggle with is knowing when that thinking is wrong — while it is happening. In high‑stakes domains like mathematics, finance, or policy automation, delayed error detection is not a feature; it is a liability.

Most modern reasoning pipelines still follow an awkward split: first generate reasoning, then verify it — often with a separate model. Humans do not work this way. We reason and judge simultaneously. This paper asks a simple but uncomfortable question: what if LLMs were trained to do the same?

The answer is Stepwise Think‑Critique (STC) — a framework that forces a model to justify each step of its own reasoning in real time, not after the fact.

Background — Context and prior art

Two major paradigms dominate LLM reasoning today:

Pure reasoning models (e.g., Chain‑of‑Thought, DeepSeek‑R1): strong at decomposition, weak at self‑verification.
Post‑hoc verification models (PRMs, external critics): good at judging mistakes, but always late — and operationally messy.

The industry workaround has been model proliferation: one model reasons, another checks. This increases latency, cost, and fragility. More importantly, it decouples learning — the reasoning model does not truly internalize its own errors.

The STC paper positions this as a conceptual flaw, not an engineering one. Humans do not outsource self‑doubt.

Analysis — What the paper actually does

STC collapses the reasoning–verification divide by forcing a single model to alternate between:

a reasoning step $r_t$
a critique step $c_t$

The output structure is explicit:

$$ r_1 \rightarrow c_1 \rightarrow r_2 \rightarrow c_2 \rightarrow \dots \rightarrow r_T \rightarrow c_T $$

Each critique includes:

a natural‑language justification
a binary correctness label (1 = correct, 0 = incorrect)

This is not cosmetic. The critique tokens are trained, rewarded, and optimized separately.

Training pipeline (condensed)

Phase	Purpose
SFT	Teach the model the format and basic critique behavior
RL (GRPO)	Jointly optimize reasoning quality and critique reliability

Reward design (the real innovation)

The model is optimized with three orthogonal rewards:

Reward	What it enforces
Reasoning reward	Final answer correctness
Critique‑consistency reward	Does the model correctly judge itself?
Format reward	Structural discipline (no hand‑wavy critiques)

Crucially, critique signals are reused as dense intermediate rewards, shaping future reasoning steps instead of merely scoring them.

This turns critique from a passive audit into an active control mechanism.

Findings — Results that matter

Reasoning performance

Across standard math benchmarks (AIME, AMC, MATH‑500, OlympiadBench), STC with RL outperforms both the base model and SFT‑only variants.

Model	Avg Pass@1	Avg Pass@8
Base (DS‑Qwen‑1.5B)	36.0	52.2
STC‑SFT	34.2	49.3
STC‑GRPO (compact)	42.2	56.3
STC‑GRPO (full)	43.6	59.0

Notably, enabling critique at inference does not harm accuracy, which is non‑trivial.

Critique quality (the harder problem)

The model learns to flag incorrect steps with increasing reliability:

Process‑level specificity improves sharply with dense critique rewards
Trade‑off observed: stronger reasoning slightly weakens final‑answer critique precision

This tension is honest — reasoning and judging are competing cognitive loads.

Test‑time scaling

Instead of majority voting, STC uses critique‑guided selection:

Choose the answer the model itself believes is correct.

This consistently outperforms majority voting and closes a large portion of the gap toward Pass@N oracle performance.

Implications — Why businesses should care

STC is not just a research novelty. It hints at a structural shift in how AI systems may be built:

Agentic systems with internal assurance loops
Audit‑ready reasoning traces without external verifiers
Lower operational complexity (one model, not three)

For regulated or high‑cost domains — finance, compliance, autonomous decision engines — this matters more than another +2% benchmark gain.

The model learns where it is wrong, not just what is right.

Limitations — And why this is still early

The authors are explicit:

Validated only on a 1.5B‑parameter model
Critique quality remains imperfect
Training cost is non‑trivial

But conceptually, the direction is clear: reasoning without self‑doubt does not scale.

Conclusion — Thinking is cheap. Judging is hard.

Stepwise Think‑Critique reframes LLM reasoning as a closed‑loop system — not a monologue, but a dialogue with itself. It trades architectural complexity for cognitive discipline.

This is not about making models verbose. It is about making them accountable to their own thoughts.

That is a quiet but profound shift.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Training pipeline (condensed)#

Reward design (the real innovation)#

Findings — Results that matter#

Reasoning performance#

Critique quality (the harder problem)#

Test‑time scaling#

Implications — Why businesses should care#

Limitations — And why this is still early#

Conclusion — Thinking is cheap. Judging is hard.#