Drafts, Then Do Better: Teaching LLMs to Outgrow Their Own Reasoning

Opening — Why this matters now

Large language models have learned to sound confident. Unfortunately, confidence is not correctness—especially in long-horizon reasoning tasks like competition math or multi-step logic. Reinforcement learning has helped, but most RL pipelines still assume a one-shot world: generate once, score once, update once.

Humans don’t work that way. We draft, reread, cringe, fix, and try again.

The paper iGRPO: Self-Feedback–Driven LLM Reasoning asks a deceptively simple question: what if the model could train on its own best prior attempt, instead of pretending every generation is independent?

The result is iGRPO—Iterative Group Relative Policy Optimization—a method that quietly shifts RL for reasoning from single-pass optimization to bootstrapped refinement.

Background — From GRPO to the limits of one-shot RL

Group Relative Policy Optimization (GRPO) emerged as a pragmatic alternative to PPO for large language models. By normalizing rewards within small groups of sampled completions, GRPO avoids training an explicit value function while remaining stable at scale.

It works—but it has a blind spot.

GRPO, like most outcome-based RL methods, treats each completion as disposable. The model never explicitly looks at its own reasoning attempts. Near-miss solutions are discarded rather than reused. A solution that is 90% correct and one that is completely wrong are both just scalar rewards.

That design choice is computationally convenient—and cognitively unrealistic.

Analysis — What iGRPO actually changes

iGRPO introduces a two-stage loop inside each RL update:

Stage 1: Exploratory Draft Generation

For a given prompt, the model samples multiple drafts and evaluates them using the same reward function already used by GRPO. No gradients are applied here. The only goal is selection.

The highest-reward draft is chosen as the model’s best prior attempt.

That best draft is appended to the original prompt and fed back into the model. From this augmented context, the model generates new completions. These refinements—not the drafts—are used for the GRPO update.

In effect, the model is trained to answer:

“Given my strongest attempt so far, can I do better?”

This is not critique generation. There is no self-verification text. No auxiliary objectives. Just a stronger context and the same scalar reward.

Why this matters

The conditioning signal evolves with the policy itself.

As the model improves, the drafts improve. As drafts improve, the refinement problem becomes more informative. The training distribution quietly shifts toward harder, higher-quality reasoning trajectories—without changing the reward function or increasing rollout budget.

That is the core insight.

Findings — Performance without extra compute

Under matched rollout budgets, iGRPO consistently outperforms standard GRPO and critique-style variants across multiple model families.

Snapshot of results (Pass@1 accuracy)

Model	GRPO Avg	iGRPO Avg	Δ
Nemotron-H-8B	41.08	45.04	+3.96
DeepSeek-R1-Distill-7B	68.29	69.87	+1.58
OpenMath-Nemotron-7B	75.02	76.07	+1.05
OpenMath-Nemotron-14B	76.73	78.00	+1.27

The gains concentrate where reasoning usually fails:

AIME-style competition problems
Long-horizon logic with late-stage errors
Benchmarks sensitive to near-miss solutions

Notably, iGRPO also delays entropy collapse during training, maintaining exploration longer before convergence. This suggests the benefit is not just better samples—but better learning dynamics.

iGRPO is not married to GRPO.

The authors show that the same two-stage self-feedback wrapper improves other PPO-style optimizers like DAPO and GSPO with minimal changes. Swap the optimizer, keep the refinement loop.

Even the reward function is flexible. Replacing a binary checker with a generative judge yields further gains, especially on problems where partial credit matters.

From a systems perspective, this is the most important takeaway:

Self-feedback does not need to be linguistic or interpretive to be useful.

A single scalar reward, reused intelligently, is enough.

Conclusion — Teaching models to iterate, not introspect

iGRPO doesn’t make language models more self-aware. It makes them more workmanlike.

Draft. Keep the best version. Improve it. Repeat.

That small procedural change closes a long-standing gap between how humans reason and how LLMs are trained. And it does so without new labels, new objectives, or new compute budgets—just better use of what was already there.

If the next generation of reasoning models feels more reliable, less brittle, and strangely more “patient,” iGRPO will likely be one of the quiet reasons why.

Cognaptus: Automate the Present, Incubate the Future.

Drafts, Then Do Better: Teaching LLMs to Outgrow Their Own Reasoning

Opening — Why this matters now

Background — From GRPO to the limits of one-shot RL

Analysis — What iGRPO actually changes

Stage 1: Exploratory Draft Generation

Stage 2: Conditioned Refinement

Why this matters

Findings — Performance without extra compute

Snapshot of results (Pass@1 accuracy)

Implications — A reusable refinement interface

Conclusion — Teaching models to iterate, not introspect

Opening — Why this matters now#

Background — From GRPO to the limits of one-shot RL#

Analysis — What iGRPO actually changes#

Stage 1: Exploratory Draft Generation#

Stage 2: Conditioned Refinement#

Why this matters#

Findings — Performance without extra compute#

Snapshot of results (Pass@1 accuracy)#

Implications — A reusable refinement interface#

Conclusion — Teaching models to iterate, not introspect#

Opening — Why this matters now

Background — From GRPO to the limits of one-shot RL

Analysis — What iGRPO actually changes

Stage 1: Exploratory Draft Generation

Stage 2: Conditioned Refinement

Why this matters

Findings — Performance without extra compute

Snapshot of results (Pass@1 accuracy)

Implications — A reusable refinement interface

Conclusion — Teaching models to iterate, not introspect