Opening — Why this matters now
Large Language Models have learned how to think out loud. What they still struggle with is knowing when that thinking is wrong — while it is happening. In high‑stakes domains like mathematics, finance, or policy automation, delayed error detection is not a feature; it is a liability.
Most modern reasoning pipelines still follow an awkward split: first generate reasoning, then verify it — often with a separate model. Humans do not work this way. We reason and judge simultaneously. This paper asks a simple but uncomfortable question: what if LLMs were trained to do the same?
The answer is Stepwise Think‑Critique (STC) — a framework that forces a model to justify each step of its own reasoning in real time, not after the fact.
Background — Context and prior art
Two major paradigms dominate LLM reasoning today:
- Pure reasoning models (e.g., Chain‑of‑Thought, DeepSeek‑R1): strong at decomposition, weak at self‑verification.
- Post‑hoc verification models (PRMs, external critics): good at judging mistakes, but always late — and operationally messy.
The industry workaround has been model proliferation: one model reasons, another checks. This increases latency, cost, and fragility. More importantly, it decouples learning — the reasoning model does not truly internalize its own errors.
The STC paper positions this as a conceptual flaw, not an engineering one. Humans do not outsource self‑doubt.
Analysis — What the paper actually does
STC collapses the reasoning–verification divide by forcing a single model to alternate between:
- a reasoning step $r_t$
- a critique step $c_t$
The output structure is explicit:
$$ r_1 \rightarrow c_1 \rightarrow r_2 \rightarrow c_2 \rightarrow \dots \rightarrow r_T \rightarrow c_T $$
Each critique includes:
- a natural‑language justification
- a binary correctness label (1 = correct, 0 = incorrect)
This is not cosmetic. The critique tokens are trained, rewarded, and optimized separately.
Training pipeline (condensed)
| Phase | Purpose |
|---|---|
| SFT | Teach the model the format and basic critique behavior |
| RL (GRPO) | Jointly optimize reasoning quality and critique reliability |
Reward design (the real innovation)
The model is optimized with three orthogonal rewards:
| Reward | What it enforces |
|---|---|
| Reasoning reward | Final answer correctness |
| Critique‑consistency reward | Does the model correctly judge itself? |
| Format reward | Structural discipline (no hand‑wavy critiques) |
Crucially, critique signals are reused as dense intermediate rewards, shaping future reasoning steps instead of merely scoring them.
This turns critique from a passive audit into an active control mechanism.
Findings — Results that matter
Reasoning performance
Across standard math benchmarks (AIME, AMC, MATH‑500, OlympiadBench), STC with RL outperforms both the base model and SFT‑only variants.
| Model | Avg Pass@1 | Avg Pass@8 |
|---|---|---|
| Base (DS‑Qwen‑1.5B) | 36.0 | 52.2 |
| STC‑SFT | 34.2 | 49.3 |
| STC‑GRPO (compact) | 42.2 | 56.3 |
| STC‑GRPO (full) | 43.6 | 59.0 |
Notably, enabling critique at inference does not harm accuracy, which is non‑trivial.
Critique quality (the harder problem)
The model learns to flag incorrect steps with increasing reliability:
- Process‑level specificity improves sharply with dense critique rewards
- Trade‑off observed: stronger reasoning slightly weakens final‑answer critique precision
This tension is honest — reasoning and judging are competing cognitive loads.
Test‑time scaling
Instead of majority voting, STC uses critique‑guided selection:
Choose the answer the model itself believes is correct.
This consistently outperforms majority voting and closes a large portion of the gap toward Pass@N oracle performance.
Implications — Why businesses should care
STC is not just a research novelty. It hints at a structural shift in how AI systems may be built:
- Agentic systems with internal assurance loops
- Audit‑ready reasoning traces without external verifiers
- Lower operational complexity (one model, not three)
For regulated or high‑cost domains — finance, compliance, autonomous decision engines — this matters more than another +2% benchmark gain.
The model learns where it is wrong, not just what is right.
Limitations — And why this is still early
The authors are explicit:
- Validated only on a 1.5B‑parameter model
- Critique quality remains imperfect
- Training cost is non‑trivial
But conceptually, the direction is clear: reasoning without self‑doubt does not scale.
Conclusion — Thinking is cheap. Judging is hard.
Stepwise Think‑Critique reframes LLM reasoning as a closed‑loop system — not a monologue, but a dialogue with itself. It trades architectural complexity for cognitive discipline.
This is not about making models verbose. It is about making them accountable to their own thoughts.
That is a quiet but profound shift.
Cognaptus: Automate the Present, Incubate the Future.