Opening — Why this matters now

The AI industry likes to pretend that training happens in neat, well-funded labs and deployment is merely the victory lap. Reality, as usual, is less tidy. Large language models are increasingly learning after release—absorbing their own successful outputs through user curation, web sharing, and subsequent fine‑tuning. This paper puts a sharp analytical frame around that uncomfortable truth: deployment itself is becoming a training regime.

What looks like harmless post‑deployment iteration turns out to be something more consequential—an implicit form of reinforcement learning, with rewards defined not by researchers, but by users, validators, and social filtering mechanisms.

Background — Context and prior art

Improving LLM reasoning has followed two dominant paths: prompting tricks (Chain‑of‑Thought, Tree‑of‑Thoughts) and explicit reinforcement learning (PPO, GRPO, reward models). Both require deliberate design. Both assume the trainer controls the objective.

But modern LLMs live in a feedback‑rich ecosystem. Users keep answers that work. They discard those that don’t. Correct solutions get shared, scraped, and quietly recycled into future training runs. This phenomenon has been acknowledged anecdotally—especially in the GPT‑3 → GPT‑3.5 → GPT‑4 lineage—but rarely formalized.

The paper steps into that gap by studying iterative deployment under controlled conditions, asking a simple but unsettling question: Can models bootstrap real reasoning improvements just by being repeatedly deployed and selectively copied?

Analysis — What the paper actually does

The authors simulate iterative deployment using classical planning tasks—Blocksworld, Rovers, and Sokoban—domains where success is unambiguous and externally verifiable.

The process is deliberately minimal:

  1. Deploy a base LLM on a fixed set of planning problems.
  2. Validate outputs using an external checker (binary: valid or invalid).
  3. Keep only successful traces.
  4. Aggregate them across generations.
  5. Supervised fine‑tune the next model generation.

No reward model. No planner oracle. No curriculum engineering. Just selective survival of correct outputs.

Crucially, when multiple valid solutions exist, only the best is retained—shortest plan, fewer reasoning tokens. This small design choice turns out to matter a lot.

Findings — Results that matter

Across all three domains, performance improves sharply within five generations.

Domain Base → Gen‑5 Improvement
Blocksworld +196% solved tasks
Rovers +401% solved tasks
Sokoban +196% solved tasks

Later generations don’t just solve more tasks—they solve longer‑horizon problems, producing plans far outside the base model’s distribution. This is genuine generalization, not formatting polish.

Even more interesting: reasoning length does not systematically increase. Unlike many RL‑fine‑tuned models, these systems don’t learn by “thinking louder.” They learn by composing better internal structures from previously successful fragments.

Curation proves decisive. When all traces—valid and invalid—are retained, gains flatten quickly. With curation, performance nearly doubles again while using an order of magnitude less data.

Theoretical core — Deployment is implicit RL

The paper’s most important contribution is theoretical: it proves that supervised fine‑tuning on valid traces is mathematically equivalent to REINFORCE with binary rewards.

In plain terms:

  • A valid output functions as reward = 1
  • An invalid output is reward = 0
  • Keeping only valid traces produces the same gradient direction as policy‑gradient RL

When traces from previous generations are mixed in, the process becomes off‑policy RL with importance weighting—whether the trainer realizes it or not.

This means iterative deployment is not a soft heuristic. It is reinforcement learning in disguise.

Implications — Power, risk, and uncomfortable incentives

This reframing has two consequences the industry cannot ignore.

First: alignment risk. The reward function is implicit, emergent, and opaque. It reflects what users choose to share—not what designers intended. Safety training and post‑deployment reinforcement may silently push in opposite directions.

Second: governance blind spots. Validation mechanisms—human or automated—become de facto value encoders. Bias, laziness, or adversarial behavior in validation doesn’t just affect outputs. It compounds across generations.

Model collapse is not ruled out either. Curation delays it, but does not guarantee immunity.

Conclusion — The quiet loop that trains the future

Iterative deployment works. That’s the uncomfortable takeaway. Models can self‑improve meaningfully, discover longer plans, and generalize—without explicit rewards or expert teachers.

But this also means that every deployment is a policy decision, whether acknowledged or not. The real question is no longer whether models learn after release. It’s whether we understand what they are learning from.

Cognaptus: Automate the Present, Incubate the Future.