Acceptance is a reward, even when nobody writes reward = 1.
Imagine an enterprise deploys an AI agent to generate code, reconcile invoices, or prepare operational plans. Some outputs pass automated checks and enter production. Others fail, disappear into logs, and are never seen again. Months later, the accepted outputs are collected and used to fine-tune the next model.
The company may describe this as supervised learning on production data. The training team may insist that no reinforcement-learning pipeline was involved. Mathematically, that distinction may be less comforting than it sounds.
In Iterative Deployment Improves Planning Skills in LLMs, Augusto B. Corrêa and colleagues examine what happens when a model is repeatedly deployed, its successful outputs are selected, and later generations are fine-tuned on the accumulated successes.1 Their central result is not merely that this loop improves planning performance. It is that supervised fine-tuning on validated traces points in the same update direction as REINFORCE with a binary reward.
The reward function did not vanish. It moved into the validator, the curation rules, and the decision about which outputs survive.
The Deployment Loop Quietly Defines a Training Objective
The mechanism studied in the paper contains four recurring stages:
- Deploy the current model on a collection of tasks.
- Validate the generated plans and reasoning traces.
- Retain valid outputs, selecting the highest-quality trace when several solutions exist.
- Fine-tune the next generation on the accumulated curated traces.
The simplified loop is:
Deploy → Generate → Validate → Curate → Fine-tune → Deploy again
No stronger teacher model generates solutions. No external planner supplies optimal demonstrations. No deliberately designed reward model scores each response. The model learns from the subset of its own outputs that the validator permits to continue existing.
The paper tests this mechanism using classical planning tasks expressed in the Planning Domain Definition Language, or PDDL. These tasks provide something most real-world AI workflows lack: a deterministic external validator that can state whether a proposed plan actually reaches the required goal.
The model therefore receives a clean selection signal. A valid plan survives. An invalid plan does not.
But validity is only the first filter. When several generations solve the same task, the researchers keep one training example: the trace producing the shortest plan, with fewer reasoning tokens used as the tie-breaker. The empirical pipeline therefore expresses two preferences:
- The plan must work.
- Among working plans, simpler and more concise plans are preferred.
That second preference looks like an innocent data-cleaning rule. It is also an objective.
The Model and Validator Build Their Own Curriculum
The loop begins with tasks of varying difficulty. Early generations solve mostly simpler problems. Their valid traces enter the training set. Later generations learn from those traces and begin solving tasks that were previously beyond the model’s successful frontier.
The researchers do not manually arrange a curriculum from easy tasks to hard ones. The curriculum emerges from the interaction between the model’s current capabilities and the validator:
- Tasks below the model’s capability frontier produce usable training traces.
- Tasks beyond the frontier produce invalid traces and disappear.
- Fine-tuning on accumulated successes moves the frontier.
- Newly solvable tasks then become training data for later generations.
This is self-improvement, but not self-improvement in isolation. The validator decides which experiences count as lessons.
An important implementation detail makes the mechanism cleaner than simple serial fine-tuning. For each generation, the researchers fine-tune the original Qwen3 4B base model using the accumulated traces from previous generations rather than stacking new LoRA adapters on top of earlier adapters. The improvement therefore comes from the expanding curated dataset, not merely from repeatedly modifying an already modified model.
Why Fine-Tuning on Successes Resembles REINFORCE
The paper’s theoretical contribution begins with the standard REINFORCE objective. Suppose a model with parameters $\theta$ generates trace $y$ for task $x$, and an external validator assigns a binary reward:
The REINFORCE gradient is:
Because invalid traces receive a reward of zero, they contribute nothing to the gradient. Only valid traces remain.
Now consider supervised fine-tuning on the valid traces:
Gradient descent on this loss increases the probability of the successful traces. Up to a positive scaling factor, its update direction is the same as gradient ascent under REINFORCE with the binary reward.
The paper formalizes this result in three steps:
| Proposition | What it establishes | Practical interpretation |
|---|---|---|
| SFT on valid on-policy traces | Its gradient direction matches REINFORCE with binary rewards | Keeping only successful current outputs acts like rewarding them |
| SFT mixing current and earlier valid traces | Earlier-generation traces can be represented through importance-weighted contributions | Historical successful traces remain part of the effective learning signal |
| Iterative deployment | The complete repeated loop is a special case of REINFORCE with implicit binary rewards | A production-data pipeline can behave like RL without being labelled as RL |
This equivalence should not be exaggerated. The paper does not claim that the pipeline reproduces every property of modern RL training. It does not introduce advantage estimation, explicit exploration incentives, preference comparisons, or richly graded rewards. It shows that the direction of the successful-trace update corresponds to a particular binary-reward policy-gradient case.
That narrower result is already consequential.
The loop learns from what succeeds, but receives no direct explanation of why failed traces failed. It also does not learn from outcomes that the validator cannot observe. If the validator checks only whether a workflow completes, the effective reward says nothing about whether it completed safely, fairly, cheaply, or for the right reason.
The absence of an explicit reward model is therefore not the absence of a reward. It is the absence of a clearly documented one.
The Experiment Is Designed to Isolate the Loop
The authors use Qwen3 4B Thinking 2507 as the base model and test it on three classical planning domains. Each domain contains 1,000 tasks of varying difficulty, and a separate model is trained for each domain.
| Domain | Planning challenge | What makes it useful for the experiment |
|---|---|---|
| Blocksworld | Rearrange stacked blocks into a target configuration | Plans have understandable dependencies and a known simple upper-bound strategy |
| Rovers | Coordinate rovers, cameras, samples, waypoints, and communication | Tests sequencing across several interacting requirements |
| Sokoban | Push boxes through a grid without creating irreversible dead ends | Requires longer-horizon planning and can become computationally difficult |
The model receives a prompt containing the domain definition, the target task, and two example plans from unrelated Gripper and Logistics domains. It must then produce a reasoning trace and a proposed plan. The VAL planning validator checks whether the plan is valid.
Each new generation is trained using LoRA on the accumulated curated traces. The main experiments run for five generations and are repeated three times. The paper reports:
- Average tasks solved across the three runs.
unanimous@3, the number of tasks solved by all three runs.- The lengths of successfully generated plans.
- The average number of reasoning tokens.
- An ablation comparing curated training with training on almost all generated traces.
These measurements serve different purposes. The solved-task counts are the main evidence. unanimous@3 checks consistency under sampling variation. Plan-length analysis tests whether gains extend beyond formatting and basic instruction following. Reasoning-token length is a diagnostic. The no-curation comparison is an ablation testing whether selection is actually necessary.
Treating every figure as a separate breakthrough would be enthusiastic, but not analytical.
Solved-Task Counts Rise Sharply, Then Begin to Plateau
After five generations, the average number of solved tasks rises substantially in all three domains.
| Domain | Base model | Generation 5 | Relative increase | Generation-5 solve rate |
|---|---|---|---|---|
| Blocksworld | 52.0 | 154.0 | 196% | 15.4% |
| Rovers | 41.0 | 205.6 | 401% | 20.6% |
| Sokoban | 32.6 | 96.6 | 196% | 9.7% |
The relative improvements are large. The absolute results deserve equal attention.
Rovers performance grows roughly fivefold, but generation 5 still solves only about one-fifth of the 1,000 tasks. Sokoban nearly triples, yet more than nine out of ten tasks remain unsolved. The mechanism expands capability meaningfully; it does not produce a reliable general-purpose planner after five rounds of fine-tuning.
The trajectory also is not monotonically upward. Most improvement occurs during the first three generations. Blocksworld falls from 148.6 solved tasks in generation 3 to 142.0 in generation 4 before recovering to 154.0. Rovers peaks slightly at 206.6 in generation 3, falls in generation 4, and reaches 205.6 in generation 5.
Some variation is expected because inference is stochastic. More importantly, the pattern suggests diminishing returns once the readily learnable successful traces have been absorbed. Repeating the loop does not guarantee the same marginal improvement forever. Eventually, the model may need a broader task distribution, better validation, different exploration, or new external knowledge.
Consistency Improves Alongside Average Performance
The unanimous@3 results strengthen the main finding. This metric counts only tasks solved in all three independent runs.
| Domain | Base unanimous@3 | Generation-5 unanimous@3 |
|---|---|---|
| Blocksworld | 5 | 38 |
| Rovers | 1 | 105 |
| Sokoban | 9 | 60 |
The gains therefore are not explained solely by one unusually fortunate sampling run. Later generations solve more tasks consistently.
Still, unanimous@3 measures repeatability within this experimental setup. It does not establish robustness to new domains, altered prompts, noisy validators, or operational distribution shifts. Consistency is improving, but inside a carefully controlled box.
Longer Plans Answer the Formatting-Trick Objection—Partly
A common objection to iterative fine-tuning is that apparent improvement may reflect superficial learning. Perhaps the model merely becomes better at following the required output format, writing valid PDDL syntax, or avoiding minor procedural errors.
The paper addresses this objection by examining the lengths of solved plans.
In Blocksworld, the base model mostly finds plans of up to approximately 20 steps, while generation 5 finds many plans reaching roughly 35 steps. Later Sokoban generations also solve more tasks requiring longer plans. In Rovers, later generations produce more solutions in the 10-to-30-step range and include several longer outliers.
This analysis supports a stronger interpretation than simple formatting improvement: later generations reach tasks requiring longer action sequences.
The exact boundary matters. The researchers use the same fixed collection of tasks for deployment, trace collection, and evaluation. Later generations are solving instances from that pool that earlier generations failed to solve. Their solutions were not previously available as training examples, so the result indicates transfer from solved tasks to previously unsolved tasks. But this is not a conventional held-out test on entirely unseen instances.
It is evidence of capability expansion across the difficulty range of a fixed benchmark pool. Calling it unrestricted out-of-distribution generalization would be doing the conclusion an unnecessary favour.
The plan-length evidence also does not show that later generations produce increasingly long solutions for the same already-solvable task. The paper reports no clear trend when comparing plan lengths produced by different generations for the same task; they are often similar. The main change is that later models reach harder tasks, not that they become fond of unnecessarily elaborate plans. A welcome restraint.
More Reasoning Tokens Are Not Driving the Improvement
Reinforcement-trained reasoning models often increase their inference-time reasoning length as training progresses. The authors therefore examine whether later generations improve simply by generating longer traces.
They do not find a consistent increase.
Average reasoning-token counts remain broadly similar across generations. Blocksworld and Sokoban show slight declines, while Rovers shows a slight increase. The base-to-generation-5 differences are around 2,000 tokens per domain, small relative to total traces that average roughly 15,000 to 20,000 tokens.
This result is best interpreted as a diagnostic:
- It weakens the explanation that performance gains come merely from spending more tokens.
- It does not prove that the model has learned a particular superior internal reasoning strategy.
- It does not show that reasoning traces have become more faithful or interpretable.
The earlier version of the story might be tempted to say that the model learns to “think better, not longer.” The experiment supports “not systematically longer.” The “better” part is observed through successful plans, not directly through an inspection of internal reasoning quality.
Curation Is the Engine, Not a Cleanup Step
The most operationally useful experiment is the curation ablation.
For Blocksworld, the researchers repeat the iterative process without filtering for valid plans. Almost all generated traces from previous generations are used for fine-tuning, except traces that reach the maximum generation limit.
| Generation | With curation | Without curation |
|---|---|---|
| Base | 52.0 | 52.0 |
| Generation 1 | 109.6 | 71.0 |
| Generation 2 | 132.0 | 82.0 |
| Generation 3 | 148.6 | 84.3 |
| Generation 4 | 142.0 | 84.0 |
| Generation 5 | 154.0 | 79.3 |
The uncurated version still improves over the base model. Exposure to more traces may teach syntax, formatting, and instruction-following patterns even when many plans are invalid.
But the gains flatten quickly. By generation 5, the curated model solves 154.0 tasks on average, compared with 79.3 for the uncurated model: approximately 94% more.
The data requirement moves in the opposite direction. The curated generation-5 model uses 356 traces, while the uncurated run uses 4,017. The curated pipeline performs better using less than one-tenth as many traces.
The paper’s sentence describing these figures repeats “with curation” for both trace counts, but the experiment and surrounding comparison make clear that 4,017 refers to the without-curation run.
This is more than a data-efficiency result. It shows that the filter shapes what the model learns more strongly than raw data volume does.
The empirical curation rule is also quite selective. When several valid outputs exist for one task, the pipeline keeps only the shortest plan, breaking ties using reasoning-token count. The researchers report that alternative variants—such as keeping several traces for one task or breaking ties randomly—reduced performance.
The lesson is not merely “use synthetic data carefully.” It is that the selection policy is part of the learning algorithm.
For Businesses, the Validator Becomes the Product
The paper directly demonstrates iterative improvement in deterministic planning tasks. Translating that result into business practice requires separating evidence from inference.
| Level | Statement |
|---|---|
| What the paper directly shows | A small language model can improve on fixed-domain planning tasks by repeatedly training on its own validated and selectively curated successes |
| What Cognaptus infers | Similar loops may be valuable in enterprise workflows where outcomes can be checked reliably and successful traces contain reusable procedural knowledge |
| What remains uncertain | Whether the same gains appear in open-ended, subjective, noisy, changing, or safety-critical workflows |
The strongest candidates for iterative deployment are tasks with validators that are cheaper and more reliable than producing expert demonstrations.
Possible examples include:
- Code generation checked by compilation, unit tests, static analysis, and security tests.
- SQL generation checked against schemas, execution constraints, and reconciled outputs.
- Data-transformation workflows validated through type rules, totals, and consistency checks.
- Tool-using agents checked against permitted state transitions and completed API actions.
- Mathematical or symbolic tasks with deterministic verification.
- Document-processing pipelines whose extracted values can be reconciled against trusted records.
In these settings, organizations may not need experts to write every ideal trace. They can let the deployed system generate candidate traces, validate the results, and retrain on selected successes.
The resulting operating model is less like conventional data labelling and more like managing a production learning loop.
| Component | Operational responsibility | Typical hidden risk |
|---|---|---|
| Task distribution | Decide which situations the model encounters | Frequent tasks crowd out rare but important cases |
| Validator | Determine which outputs count as successful | Measurable completion substitutes for genuine quality |
| Curation policy | Choose among successful traces | Efficiency preferences erase necessary caution or diversity |
| Historical replay | Decide which prior traces remain influential | Outdated behaviours persist after requirements change |
| Fine-tuning process | Convert selected traces into a new model | Narrow success causes broader capability drift |
| Release gate | Decide whether the new generation is deployable | Average gains conceal regressions in critical subgroups |
The expensive asset may therefore cease to be the training corpus itself. It may become the validator architecture: the collection of tests, reconciliation rules, policy checks, and exception detectors that determine what the model is allowed to learn from.
A weak validator produces cheap training data and expensive mistakes. Very efficient, in the least helpful sense of the word.
An Implicit Reward Can Conflict With the Official Policy
The same mechanism that creates business value also creates the paper’s central safety concern.
In explicit reinforcement learning, designers at least attempt to specify the objective. They can inspect the reward function, test it, add constraints, and argue about its failures in meetings long enough to create several additional meetings.
Under iterative deployment, the effective reward may arise indirectly from:
- Which outputs users accept or publish.
- Which responses are retained in codebases or operational records.
- Which failures are visible to automated tools.
- Which successful-looking traces enter future training data.
- Which candidate outputs are judged simpler, cheaper, or more engaging.
These signals express revealed preferences, not necessarily declared policies.
Suppose a customer-service agent is rewarded implicitly whenever a case is closed. If faster closure is easier to observe than customer understanding, the curated dataset may favour abrupt resolutions. Suppose a coding agent is selected mainly on passing functional tests. It may learn patterns that satisfy the tests while accumulating security or maintainability problems outside the validator’s coverage.
The validator does not need malicious intent to create a harmful objective. It only needs blind spots.
The paper also warns that accumulated implicit gradients may conflict with explicit safety training. A model may be aligned before deployment toward one set of behaviours, then repeatedly fine-tuned on accepted production traces that favour another. If the selection process is opaque, teams may observe a behavioural shift without recognizing the objective that produced it.
This safety argument is theoretical rather than experimentally demonstrated in the paper. No experiment measures an actual conflict between deployment curation and alignment training. Nevertheless, the theoretical connection makes the governance question concrete: post-deployment data selection is capable of exerting reward-like pressure, so it should be governed as such.
A Safer Iterative-Deployment Programme Needs Counter-Rewards
Organizations adopting this approach should not treat “valid output” as a complete definition of success. At minimum, the learning loop needs several distinct evaluation layers:
- Functional validity: Did the output complete the requested task?
- Quality and efficiency: Was the result accurate, concise, and operationally useful?
- Constraint compliance: Did it respect security, legal, safety, and policy requirements?
- Coverage: Does the validator detect failures across rare and difficult cases?
- Held-out performance: Does the new generation improve on tasks that never entered the training loop?
- Regression monitoring: Which previously reliable capabilities deteriorated?
- Reward audit: What observable behaviour is the curation process actually favouring?
The last question is the least convenient and the most important.
A production team may believe it is selecting “good answers.” The actual data pipeline may be selecting answers that are easy to verify, quickly accepted, frequently shared, or compatible with existing systems. Those are not equivalent objectives.
Any iterative-deployment system should therefore maintain trace lineage, validator-version records, curated-data snapshots, held-out safety evaluations, and rollback capability. Once the curation mechanism is understood as an implicit reward, these controls stop looking like optional data hygiene.
What the Paper Does Not Yet Establish
The study provides a clean mechanism and compelling controlled evidence. Its boundaries materially affect how the results should be used.
The Validator Is Much Cleaner Than Most Business Feedback
Classical planning offers deterministic actions, full observability, discrete states, and an exact validator. Many enterprise tasks involve ambiguous goals, delayed outcomes, incomplete information, and disagreements over what success means.
A binary validator works beautifully when validity is genuinely binary. Reality has a habit of submitting incomplete specifications.
The Main Results Come From One Model and Domain-Specific Fine-Tuning
The experiments use Qwen3 4B Thinking 2507, with a separate fine-tuned model for each of the three main domains. The evidence does not establish that the same process improves larger models, non-reasoning models, multimodal systems, or a single agent operating across diverse enterprise workflows.
The Task Pool Is Reused
The same set of tasks is used for trace collection and evaluation. Later generations solve tasks whose successful traces were not previously available, so the result cannot be dismissed as simply memorizing their solutions. But the experiment does not provide a conventional held-out evaluation on a separate unseen task set.
The longer-plan result should therefore be read as transfer toward previously unsolved tasks within the benchmark pool.
The Study Does Not Compare Total Costs
The curated pipeline uses far fewer traces than the uncurated version, but the paper does not compare total cost against explicit RL, expert demonstrations, stronger teacher models, or alternative synthetic-data methods. Validator development, repeated inference, fine-tuning, governance, and regression testing all consume resources.
Fewer training traces do not automatically mean lower total cost.
Curation Does Not Yet Defeat Model Collapse
The authors report that experiments extended to ten generations did not show imminent signs of collapse in the planning setting, although improvements slowed after generation five. This is an exploratory observation, not evidence that curation prevents long-term model collapse.
Curation may delay collapse, prevent it under certain conditions, or introduce different forms of narrowing. The paper leaves that question open.
The Safety Risk Is Plausible but Not Empirically Tested
The theoretical equivalence shows why curation behaves like a reward signal. The paper does not experimentally demonstrate biased validators causing accumulated harmful behaviour or overriding prior safety training. That remains an important research and governance hypothesis rather than an observed result here.
Deployment Is Training Whenever Its Outputs Return
The paper begins with a simple loop: generate outputs, keep the successful ones, and train again.
Its significance comes from recognizing what that loop really does.
Selective reuse allows a model to expand its planning capability without benchmark solutions from an expert planner or demonstrations from a stronger teacher. Within five generations, Qwen3 4B solves substantially more tasks, reaches longer-horizon problems, and becomes more consistent. Careful curation produces much stronger results than indiscriminate reuse while requiring far fewer traces.
But the mechanism works because selection defines an objective.
Every validator rewards what it can see. Every curation rule prefers some successful behaviours over others. Every accepted trace increases the probability of similar traces appearing again. Once deployment outputs return to the training pipeline, operational choices become training choices.
The useful business question is therefore not simply whether production traces can improve the next model. They can, under the right conditions.
The harder question is whether the organization understands what its production system is rewarding before the next generation learns the answer for it.
Cognaptus: Automate the Present, Incubate the Future.