Drafts, Then Do Better: Teaching LLMs to Outgrow Their Own Reasoning

Most office work has a draft problem.

A junior analyst writes a first version of a financial memo. A lawyer marks up an argument. A consultant turns messy meeting notes into a client-ready recommendation. The first attempt is rarely useless. It is usually half-right, locally clever, and globally flawed. The expensive part is not starting from zero. The expensive part is learning how to improve a decent draft without being hypnotized by it.

That is also the interesting part of iGRPO: Self-Feedback–Driven LLM Reasoning.¹ The paper is easy to misread as another entry in the familiar self-improvement genre: make the model produce several answers, pick the best one, maybe ask it to critique itself, and hope the second answer behaves like a wiser intern. Pleasant story. Slightly too neat. The actual contribution is more precise.

iGRPO does not mainly ask the model to judge itself at inference time. Nor does it simply spend more samples at test time and call the bill “reasoning.” Instead, it changes the training loop. During reinforcement learning, the model first generates several drafts, an external reward signal selects the highest-scoring draft, and then the model is trained to produce refinements conditioned on that draft. After training, inference remains ordinary single-shot generation. No extra draft generation. No voting ceremony. No little committee of language models whispering in a trench coat.

That distinction matters for business users because inference latency is where many AI projects quietly die. A method that improves reasoning by multiplying test-time samples can be useful, but it changes operating cost. iGRPO’s promise is different: use more structured learning during training so the deployed model behaves as if it has internalized a refinement habit.

The mechanism is not “self-critique”; it is draft-conditioned training

Standard GRPO, or Group Relative Policy Optimization, trains a model by sampling a group of completions for a prompt, scoring them, normalizing rewards within the group, and updating the policy toward higher-reward completions. The attraction is operational: no separate value model is needed. For large language model post-training, avoiding an additional critic is not a philosophical victory; it is a line item.

iGRPO keeps that group-relative logic but inserts a two-stage structure inside each optimization step.

First, the model samples candidate drafts for the same prompt. These are scored by the same scalar reward used for optimization. In the paper’s main math setting, the reward is rule-based: did the final answer match the reference answer, with formatting constraints also used in training. The best draft is selected:

$$ d^\ast = \arg\max_{d_i} R(x, d_i) $$

where $x$ is the original prompt, $d_i$ is a sampled draft, and $R$ is the reward function.

Second, the original prompt is augmented with that best draft. The model then generates refinements conditioned on both the problem and the selected draft. Only these second-stage refinements receive gradient updates. The first-stage draft shapes the context, but it is not directly optimized through.

That last sentence is the mechanical center of the paper. Stage 1 creates the learning situation. Stage 2 receives the learning signal.

Component	What happens	What it is easy to misunderstand
Stage 1 draft generation	The model samples several candidate drafts and the highest-reward one is selected	This is not inference-time best-of-N deployment
Draft conditioning	The selected draft is appended to the original prompt as guidance	The draft is not treated as guaranteed truth
Stage 2 refinement	The model generates new completions conditioned on the draft and is updated with GRPO-style advantages	The model is trained to improve beyond the draft, not merely copy it
Deployment	The trained model is used normally, without draft conditioning	The training loop is iterative; inference does not have to be

The authors describe this as dynamic self-conditioning. That phrase is worth keeping, because it separates iGRPO from static in-context learning. A static demonstration remains fixed while the model changes. In iGRPO, the conditioning signal is produced by the current policy snapshot. As training improves the policy, the selected drafts should also improve, which gives later iterations better scaffolds.

The paper gives a simple binary-reward argument for this bootstrapping effect. If the probability that one draft is correct is $p$, and the method samples $G$ independent drafts, then the chance that at least one draft is correct is:

$$ 1 - (1-p)^G $$

As $p$ rises, the expected quality of the best selected draft rises as well. This is not a grand theory of intelligence. It is a useful engineering observation: if your reward signal can identify better attempts, the model can be trained in an environment where its own best recent attempt becomes part of the next learning problem.

The first business lesson is about where feedback enters the system

The obvious story is “models need feedback.” True, but unhelpful. Everything in AI now claims to use feedback, usually with the same level of specificity as a restaurant saying it uses ingredients.

The sharper question is: where does the feedback enter?

In self-verification methods, the model may generate solutions and then score, reweight, or aggregate them. In critique-style methods, the model may generate critiques or critique-conditioned refinements. These can work, but they ask the model to perform auxiliary behaviors: judging, explaining, criticizing, or verifying. iGRPO uses the best draft itself as the feedback object. The reward signal chooses the draft; the model then learns a refinement policy around it.

This creates a cleaner division of labor. The reward system says, “This draft was the best among your attempts.” The model says, “Given that starting point, I will produce something better.”

That design matters for enterprise settings where the reward signal is external and concrete. A code unit test does not need to write a critique. A spreadsheet reconciliation check does not need to philosophize. A compliance rule engine does not need to produce a warm paragraph about its concerns. It can simply assign a score or pass/fail label. The learning loop can then use that score to select the best attempt and train the model to refine from it.

For Cognaptus-style automation, this points to a practical category of model improvement: train on tasks where success can be checked, but where the path to success still requires reasoning. Think of code repair, calculation-heavy reporting, financial classification with audit checks, structured document extraction, workflow QA, or policy-rule interpretation. These are not all “math problems,” but they share the paper’s important property: partial attempts can be evaluated, and better attempts can be distinguished from worse ones.

That is the business pathway. It is not magic. It is plumbing with taste.

The evidence says the gains are real, but the size depends on model headroom

The paper’s main experiments evaluate Pass@1 accuracy on mathematical reasoning benchmarks including AIME24, AIME25, MATH500, AMC, GSM8K, and Minerva Math. The authors compare iGRPO with base models, GRPO, and where available, Self-Verification and Critique-GRPO. The key control is that the total rollout budget is matched: standard GRPO uses eight completions per prompt, while iGRPO splits the same total budget across the two stages.

This is important. Without that control, the experiment would be much less interesting. More samples often buy better performance. The claim here is that rearranging the same sampling budget into draft selection plus refinement training improves the learning signal.

The headline pattern is consistent. iGRPO beats GRPO across the reported 7B, 8B, and 14B settings. But the magnitude varies, and the variation is useful.

On the Nemotron-H-8B-Base-8K model, a more general base model, the macro-average rises from 41.08 with GRPO to 45.04 with iGRPO. That is a 3.96-point gain. It also beats the reported Self-Verification and Critique-GRPO baselines in that setting.

On DeepSeek-R1-Distill-Qwen-7B, a stronger distilled reasoner, the gain is smaller: GRPO reaches 68.29 average, while iGRPO reaches 69.87. The gain is still consistent, but the model has less room to improve. On OpenMath-Nemotron-7B, already math-specialized, GRPO barely changes the base average, while iGRPO raises it from 75.02 to 76.07. Again: not fireworks, but not noise either.

The 14B results follow the same logic. DeepSeek-R1-Distill-Qwen-14B improves from 71.29 with GRPO to 73.02 with iGRPO. OpenMath-Nemotron-14B improves from 76.73 to 78.00.

Evidence block	Likely purpose	What it supports	What it does not prove
Main benchmark table	Main evidence	iGRPO improves Pass@1 over GRPO under matched rollout budgets across several model families and scales	It does not prove the method generalizes to every non-math domain
Comparisons with Self-Verification and Critique-GRPO	Comparison with prior work	Draft-conditioned training can outperform critique or verification-style baselines in reported settings	It does not prove critique is generally inferior
Stronger base and AceReason-Math setting	Robustness / generalization test	iGRPO still helps when initialization and training data are stronger	It remains mostly within reasoning-heavy benchmark culture
DAPO and GSPO wrapper test	Ablation	The benefit appears to come from the refinement interface, not GRPO-specific details	It does not show compatibility with all RL algorithms
Entropy analysis	Learning-dynamics probe	iGRPO delays entropy collapse during training	It does not by itself explain all accuracy gains
Memory and throughput comparison	Implementation-cost test	Peak memory is nearly identical; training time rises about 13%	It does not mean training is cheap in absolute terms

The phrase “consistent gains” is appropriate here. The phrase “revolutionary breakthrough” should be left outside, where it can get some air and calm down.

The results are strongest where the model has meaningful headroom and where the benchmark punishes long-horizon reasoning errors. This fits the mechanism. If a model’s first draft is often bad, selecting the best among several gives Stage 2 a better scaffold than a random attempt. If the model’s first draft is often almost correct, refinement training can reinforce the habit of fixing late-stage errors. If the model is already very strong and the benchmark is saturated, the extra signal has less room to express itself.

This is also how businesses should read the paper. iGRPO is not a universal button labeled “better reasoning.” It is more like a training pattern for tasks where models produce useful but imperfect first attempts and where a reward system can reliably rank those attempts.

The ablations are doing more than decorating the paper

Some ablation sections exist because papers need ablation sections. This one is more informative. The authors test whether the improvement is really tied to GRPO, whether richer reward signals help, and whether the learning dynamics differ from ordinary GRPO.

The DAPO and GSPO tests matter because they ask whether iGRPO is a specific algorithm or a reusable wrapper. The reported averages improve from 69.74 to 70.93 for DAPO and from 69.20 to 70.31 for GSPO when the self-feedback refinement layer is added. These are not enormous gains, but they point in the right direction: the refinement interface seems to carry value beyond the exact GRPO objective.

The generative judge study is also business-relevant. The main experiments use rule-based rewards for verifiable math answers. That is clean but narrow. The paper then replaces the rule-based reward with a GPT-5 judge in one setting, reporting improvement from 69.87 to 70.81 average on DeepSeek-R1-Distill-Qwen-7B. The authors interpret the gains as evidence that partial credit can help near-miss reasoning traces survive Stage 1 selection and become useful scaffolds for Stage 2.

For enterprise use, this is the bridge from hard verification to softer evaluation. Many business tasks are not binary. A generated compliance memo can be legally incomplete without being entirely wrong. A customer-support answer can be mostly correct but poorly grounded. A data-cleaning script can pass some checks and fail others. If a judge, rubric, or evaluator can assign a useful scalar score, iGRPO-like training becomes more plausible.

But “plausible” is not “solved.” A generative judge introduces its own failure modes: bias, inconsistency, reward hacking, overfitting to the judge’s preferences, and cheerful nonsense with a confident score. The paper shows compatibility with richer scalar rewards. It does not remove the ancient curse of evaluation design. Sadly, someone still has to do the work.

The entropy analysis adds another clue. The authors report that GRPO’s per-token policy entropy collapses faster, while iGRPO maintains higher mid-training entropy before converging near GRPO later. The interpretation is that draft-conditioned refinement sustains exploration longer. That is mechanically credible: if the model is conditioned on a strong but imperfect scaffold, it can explore ways to repair and extend it rather than prematurely locking into one completion pattern.

This matters because reinforcement learning for reasoning often faces a mode-collapse problem. Once the model finds a rewarded style, it may overproduce that style even when the problem needs a different route. iGRPO appears to slow that collapse, at least in the reported setting. The business translation is simple: better training dynamics may reduce brittle reasoning habits. The boundary is equally simple: entropy curves are diagnostic evidence, not a guarantee of robust behavior in production.

The cost story is favorable, not free

The paper’s cost claim deserves careful wording.

Under matched rollout budgets, iGRPO does not require more total sampled completions per prompt than GRPO. It redistributes the sampling budget: some completions become first-stage drafts, and the rest become second-stage refinements. This supports the authors’ argument that the dominant generation budget can be comparable.

The appendix then reports a more concrete resource comparison. Peak memory is essentially unchanged: 54.9286 GB for GRPO versus 54.9349 GB for iGRPO in the measured setup. Throughput drops from 0.41 samples per second to 0.34. Total GPU hours rise from 83.3 to 94.1, or about 13% more training time.

That is a good trade only if the task value justifies the accuracy gain. For competition math, better AIME scores are the point. For a business system, the calculation is different.

If a workflow is high-volume and latency-sensitive, avoiding inference-time multi-sampling can be valuable. Paying 13% more during training may be attractive if it prevents permanent inference-time bloat. If the use case is low-volume, high-stakes, and already uses multiple model calls for review, the economics may look different. If the company cannot build reliable reward functions, the entire method becomes an elegant diagram taped to a locked door.

Operational question	iGRPO implication
Does it increase inference latency?	Not necessarily; the trained model is used single-shot in the paper’s setup
Does it require extra training complexity?	Yes; two-stage rollout orchestration and reward-based draft selection are needed
Does it require more peak memory?	The reported measurement shows near-identical peak memory in the tested 7B setup
Does it increase training time?	Yes; the paper reports about 13% more total GPU hours in the measured setup
Does it need a reward model or verifier?	Yes; the method depends on scalar rewards useful enough to rank drafts and train refinements

The phrase “minimal overhead” is therefore defensible only in relative terms. It does not mean lightweight. The reported setup uses serious infrastructure: A100 GPUs, vLLM generation, large batches, long completions, and reinforcement learning engineering. For small firms, this is not a weekend notebook exercise unless the weekend has a procurement department.

What Cognaptus would infer for applied AI systems

The paper directly shows that iGRPO improves mathematical reasoning benchmarks under controlled training conditions. The business inference is broader but should stay disciplined.

The most promising use cases share four features.

First, the task has a checkable outcome. This may be a unit test, a final numeric answer, a rule-compliance score, a database consistency check, or a structured human rubric.

Second, the model’s first attempts contain useful intermediate structure. iGRPO is less compelling if wrong answers are random garbage. It is more compelling when drafts often contain partial reasoning that can be repaired.

Third, latency matters at deployment. If the business can afford best-of-32 inference every time, training-time refinement is less urgent. If the business needs a fast production model, internalizing refinement during training becomes more attractive.

Fourth, the organization has enough repeated task volume to justify post-training. A boutique workflow with twenty examples does not need iGRPO. It needs process design and maybe a spreadsheet. Not every nail deserves a reinforcement-learning hammer, even if the hammer has a very impressive arXiv abstract.

Here is the practical mapping:

Paper result	Business interpretation	Boundary
Best draft becomes training context	Use model-generated partial solutions as scaffolds for learning better final answers	Only works if the best draft can be identified reliably
Gains appear under matched rollout budgets	Better training structure can matter, not just more sampling	The comparison is still within controlled benchmark conditions
Stronger models still improve, but less dramatically	Refinement training may help residual reasoning errors after standard fine-tuning	ROI shrinks when base performance is already saturated
Generative judge improves one setting	Softer rubrics may extend the method beyond exact-answer math	Judge reliability becomes the central risk
Inference remains single-shot	Training-time cost may substitute for deployment-time latency	The paper does not prove equal behavior across all production distributions

For Cognaptus clients, the near-term implication is not “train iGRPO tomorrow.” It is: design AI workflows so that draft quality can be measured. Once a company can score partial outputs consistently, it owns a learning asset. Without that evaluator, every refinement loop becomes vibes with logging.

The strategic asset is not the model alone. It is the combination of task data, scoring rules, failure taxonomies, and post-training loops. iGRPO is one technical expression of that broader pattern.

The boundaries are clear enough to be useful

The paper is strong because its claim is narrow. It does not say that LLMs have discovered introspection. It does not say the model becomes a wise reflective agent. It says that, in verifiable reasoning settings, training on refinements conditioned on the model’s own best rewarded drafts improves Pass@1 performance across several model families.

That leaves several boundaries.

The evidence is concentrated in mathematical reasoning. The paper includes transfer to GPQA and MMLU-Pro in the stronger-base setting, but the training and core evaluation culture remain reasoning-benchmark heavy. Business users should not assume equal gains in open-ended writing, negotiation, legal drafting, or market analysis.

The reward signal is central. In the cleanest setting, rule-based rewards make selection and optimization straightforward. In messier domains, reward quality becomes the bottleneck. A poor evaluator will select polished errors and train the model to refine in the wrong direction. That is not self-improvement. That is self-confident deterioration, the corporate classic.

The infrastructure requirement is also nontrivial. Even if peak memory stays nearly unchanged relative to GRPO in the reported setup, the system still needs RL post-training machinery, rollout management, reward computation, and careful evaluation. The cost comparison is favorable against a similar GRPO setup, not against doing nothing.

Finally, the method improves the trained policy but does not eliminate the need for production controls. A single-shot model that has learned refinement behavior can still fail. In regulated or high-stakes workflows, iGRPO-like training would complement retrieval, validation, monitoring, human review, and fallback logic. It would not replace them.

The real lesson is to teach the model the editing habit

The value of iGRPO is not that it makes models “think twice” at inference time. That would be the simple story, and simple stories are usually where implementation details go to disappear.

The more interesting lesson is that refinement can be moved into training. The model is repeatedly exposed to its own best available attempt, then rewarded for producing something better. Over time, the draft is no longer just an output. It becomes a scaffold in the learning environment.

That is a useful design principle for applied AI. Many business tasks are not solved by asking a model to be brilliant on the first pass. They are solved by building systems where attempts are evaluated, errors are classified, and improvements are reinforced. iGRPO gives that principle a concrete RL form.

For enterprises, the question is not whether every company should train iGRPO models. Most should not, at least not directly. The better question is whether their AI workflows create measurable drafts. If they do, those drafts can become training material. If they do not, the company is merely collecting outputs and calling the folder “knowledge.”

A model that learns from its own best drafts is not human. It is not reflective in the rich cognitive sense. But it may be operationally better at a very useful behavior: start with a decent attempt, notice where the reward signal points, and do better.

In business, that already puts it ahead of many meetings.

Cognaptus: Automate the Present, Incubate the Future.

Ali Hatamizadeh, Shrimai Prabhumoye, Igor Gitman, Ximing Lu, Seungju Han, Wei Ping, Yejin Choi, and Jan Kautz, “iGRPO: Self-Feedback–Driven LLM Reasoning,” arXiv:2602.09000, 2026, https://arxiv.org/html/2602.09000. ↩︎

The mechanism is not “self-critique”; it is draft-conditioned training#

The first business lesson is about where feedback enters the system#

The evidence says the gains are real, but the size depends on model headroom#

The ablations are doing more than decorating the paper#

The cost story is favorable, not free#

What Cognaptus would infer for applied AI systems#

The boundaries are clear enough to be useful#

The real lesson is to teach the model the editing habit#