A failed automation run usually tells you more than a successful one.

A coding agent compiles the wrong program and receives a concrete error. A web-navigation agent clicks into the wrong product page and sees that the attributes do not match. A task agent tries an invalid action and the environment complains, patiently, like a machine that has seen too much. In each case, the system does not merely say “failed.” It gives clues.

The odd thing is that much of today’s agent training still treats those clues as background noise. It rewards the successful trajectory, penalizes or ignores the failed one, and then asks the model to try again. This is not learning from mistakes. It is learning from survival bias, with a GPU bill attached.

The paper Internalizing Agency from Reflective Experience introduces LEAFE, short for Learning Feedback-Grounded Agency from Reflective Experience.1 Its argument is simple enough to be uncomfortable: if autonomous agents are expected to recover from mistakes, then training should not merely reinforce trajectories that happen to end well. It should teach the model where a failing trajectory went wrong, how to roll back, and which corrected action should have been taken.

That distinction—retry versus recovery—is the useful lens for reading the paper. LEAFE is not just another agent benchmark entry with a slightly taller bar chart. It is a comparison between two training philosophies:

Training philosophy What it rewards What it tends to learn Deployment behavior
Outcome-driven RLVR / GRPO Final success More probability on already-successful paths Better first guesses, but still retry-heavy
Reflective-experience training Diagnosed correction points How to revise a failing trajectory More recoverable behavior under feedback

The business interpretation follows from that table. If the production cost of agents is dominated by retries, search, voting, and manual recovery, then the valuable capability is not merely “getting more answers right.” It is reducing the number of times the system needs luck, brute force, or human cleanup.

The real enemy is not failure; it is uninformative failure

Reinforcement Learning with Verifiable Rewards, or RLVR, has become a favored recipe for post-training reasoning and agentic models. The appeal is obvious. If a task has a verifier—unit tests, environment success, exact answers, game completion—then the system can sample trajectories, score them, and reinforce the successful ones. No human annotator needs to decide whether every intermediate step was sensible. Elegant, scalable, and just vague enough to sound inevitable.

For single-turn problems, that can work well. For long-horizon agents, the paper argues that it leaves a structural gap. A terminal reward says whether the whole trajectory succeeded. It usually does not say which decision made the trajectory unrecoverable, which observation should have changed the model’s plan, or what action should replace the bad one.

This is where the paper’s phrase “distribution sharpening” matters. GRPO-style outcome training can increase the probability of trajectories that the base model already had somewhere in its repertoire. That improves Pass@1: the chance that a single rollout succeeds. But it may do much less for Pass@k at larger sampling budgets, where the question becomes: does the model’s policy distribution contain a wider range of workable behaviors, or has training merely made it more confident in a narrow set of familiar moves?

This distinction is not academic bookkeeping. In deployed agents, Pass@1 approximates the “one clean run” experience. Pass@128, in the paper’s evaluation, is closer to a capability-boundary probe: if we sample many independent rollouts from the trained policy, does the model contain more ways to solve the task?

A model that improves Pass@1 without improving large-budget Pass@k may look better in a demo but remain brittle in operation. A model that improves large-budget Pass@k is more likely to have expanded the set of recoverable strategies. The former is polish. The latter is competence. Naturally, the former is easier to sell.

LEAFE turns environment feedback into corrective supervision

LEAFE has two stages. The first stage creates reflective experience. The second stage distills that experience into the model so it no longer needs the explicit reflection prompt at test time.

Stage 1 starts with failed or periodically reviewed trajectories. Instead of throwing them away, the agent is prompted to inspect the interaction history and identify a rollback point: the earlier decision where the trajectory became unproductive. It then writes an experience summary, a compact diagnosis-and-fix instruction grounded in the observed feedback.

The system then resets the environment to that earlier point, replays the previous history up to the rollback point, injects the experience summary, and asks the model to choose a revised action. The result is a branch: not a fresh random retry from the beginning, and not a linear “try again after the last error,” but a counterfactual continuation from the point where the mistake likely entered.

In procedural terms, LEAFE builds an implicit rollback tree. The implementation uses a queue-based breadth-first strategy, expanding branches until the attempt budget or tree depth is exhausted. In the appendix’s Sokoban example, the agent reflects on a failed path, identifies an earlier suboptimal step, resets to that step, and explores a corrected continuation. The point is not that Sokoban is commercially important. The point is that it makes causal recovery visible: an early move can poison the rest of a trajectory, and useful feedback should be attached to that earlier move, not just to the final failure.

Stage 2 is where the paper becomes more interesting for product teams. LEAFE does not merely store these reflections in an external playbook. It converts them into supervised fine-tuning data.

The training data has two parts:

Component Source Purpose What it protects against
Behavior rehearsal Successful rollouts Preserve competent actions Forgetting or destabilizing the base agent
Experience-to-policy distillation Rollback branches with corrected actions Teach the model to choose the corrected action without explicit experience text Dependence on external reflection at inference time

The second component is the core move. If experience text at step $t$ leads the model to choose an improved action $a’_t$, LEAFE trains the model to choose $a’_t$ from the original context alone. In other words, the system first uses reflection as scaffolding, then removes the scaffolding and trains the model to act as if it had internalized the correction.

That is why the paper’s title uses “internalizing.” The authors are not proposing a bigger prompt, a longer memory, or a more elaborate runtime agent loop. They are trying to push recovery behavior into the policy itself.

For businesses, this distinction is not cosmetic. External reflection loops add latency, orchestration complexity, context length, and more failure surfaces. Internalized recovery is not free either—it requires rollout infrastructure and training—but it moves part of the cost from inference-time improvisation to training-time learning. That is often the difference between an impressive prototype and a system that can be operated without everyone pretending latency is a philosophical problem.

The main evidence is about coverage, not just single-shot accuracy

The paper evaluates LEAFE on interactive agent tasks and coding tasks: WebShop, ALFWorld, ScienceWorld, Sokoban, and CodeContests. These are not identical environments, but they share a useful property: the agent acts over multiple steps and receives structured feedback from the environment.

The main result is not that LEAFE always dominates every metric. It does not. The more precise result is better: LEAFE is strongest where the paper’s mechanism predicts it should be strongest—large-budget Pass@k and recovery-oriented exploration.

On the four interactive benchmarks, LEAFE consistently reports the best Pass@128 among the compared methods in Table 1. For example, on WebShop with Qwen2.5-7B, GRPO-RLVR has higher Pass@1 than LEAFE: 67.45 versus 66.50. But LEAFE has higher Pass@128: 87.80 versus 85.40. The same pattern appears in other settings where Pass@1 is not always the headline victory, but the higher-budget capability estimate improves.

That matters because the paper’s thesis is not “LEAFE makes the model greedier.” It is “LEAFE expands the set of effective behaviors the model can reach.” If the result had only improved Pass@1 while flattening Pass@128, the method would look suspiciously like another distribution-sharpening recipe wearing a reflective hat. The large-budget gains are what keep the claim alive.

The CodeContests results make the contrast sharper. With Qwen2.5-72B, the base model reaches 33.94 Pass@128, GRPO-RLVR reaches 36.97, and LEAFE reaches 47.88. That is a +13.94 point gain over the base model and a +10.91 point gain over GRPO-RLVR. With Llama3-70B, LEAFE improves Pass@128 from 27.88 under GRPO-RLVR to 33.94.

The Pass@1 story is more mixed. On Qwen2.5-72B CodeContests, GRPO-RLVR achieves 20.45 Pass@1, while LEAFE reports 17.12. That is not a detail to hide under a tasteful rug. It is the trade-off the paper is trying to explain. GRPO can make the first sampled trajectory more likely to be one of the known good ones. LEAFE is trying to broaden recoverable solution coverage, which shows up more clearly when the evaluation asks whether the policy distribution contains more successful routes.

A practical reader should therefore avoid the lazy conclusion: “LEAFE is better than GRPO.” Better for what? If the application rewards one-shot fluency under a narrow distribution, outcome-driven training may be enough. If the application involves feedback, correction, and long-horizon decision-making, the LEAFE-style signal is more aligned with the operational problem.

The ablations explain why rollback is not just a fancy retry button

The paper includes several supporting experiments. They should not all be treated as equal-weight claims. Some are main evidence; others are ablations, robustness checks, or implementation validations.

Test or result Likely purpose What it supports What it does not prove
Main benchmark results on WebShop, ALFWorld, ScienceWorld, Sokoban Main evidence LEAFE improves large-budget Pass@128 across interactive tasks That it dominates every Pass@1 setting
CodeContests main results Main evidence Reflective correction helps where execution feedback is concrete That all coding-agent environments will behave similarly
Independent Sampling vs Iterative Refinement vs Stage 1 Ablation on exploration strategy Rollback branching finds more successful programs under fixed budget That rollback is always feasible in real systems
Rehearsal-only vs rehearsal plus counterfactual distillation Ablation on training objective Experience-to-policy distillation is the key source of Pass@128 gains That rehearsal is unimportant
MBPP out-of-distribution test Robustness / generalization check LEAFE is less damaging under distribution shift than GRPO in this coding transfer setup That it generalizes universally outside the benchmark family
Auxiliary target study with EarlyExp or RL Sensitivity / trade-off test Extra optimization can raise Pass@1 but may reduce exploration ceiling That auxiliary methods are always harmful

The Stage 1 comparison is especially useful. On CodeContests, the authors compare three ways of spending the same execution budget: Independent Sampling, Iterative Refinement, and LEAFE’s rollback-based Stage 1. For Qwen2.5-32B, Pass@128 rises from 48.92 under Independent Sampling to 51.48 under Iterative Refinement and 55.52 under LEAFE Stage 1. For Qwen2.5-72B, the same comparison is 48.65, 49.52, and 54.30. For Llama3-70B, it is 30.20, 38.10, and 42.50.

This experiment is doing a specific job. It isolates the exploration mechanism before the final distillation story. The takeaway is not merely “branching is good.” It is that targeted branching from diagnosed earlier mistakes uses feedback more efficiently than either starting over or refining linearly from the last failed output.

That difference is intuitive if you think about debugging. A compiler error after the fifth revision may be downstream of a wrong assumption made in the first design. Iterative refinement can patch symptoms. Rollback asks where the bad branch began. This is also why LEAFE is interesting outside coding: many workflow failures are not caused by the final step. They are caused by an earlier interpretation that quietly narrowed the future.

The Stage 2 ablation then asks whether the corrective branches matter after training. On ScienceWorld, adding experience-to-policy distillation improves Pass@128 while leaving Pass@1 roughly comparable. For Qwen2.5-7B, Pass@128 moves from 59.33 to 62.00. For Llama3.1-8B, it moves from 57.33 to 59.33. For Qwen2.5-14B, it moves from 67.33 to 72.00. The gain is not theatrical, but it is directionally aligned with the argument: rehearsal preserves competence; counterfactual distillation expands corrective behavior.

The out-of-distribution MBPP test is a useful caution against overreading GRPO gains. Models trained on CodeContests are evaluated on MBPP. GRPO shows drops relative to the base model: from 85.45 to 81.22 for Qwen2.5-32B, from 83.33 to 81.22 for Qwen2.5-72B, and from 78.31 to 74.07 for Llama3-70B. LEAFE preserves or slightly improves performance: 85.45, 85.13, and 79.63 respectively.

This does not prove LEAFE has universal out-of-domain generalization. It does suggest that outcome-only reinforcement can over-specialize to the training distribution, while feedback-grounded correction may preserve more transferable behavior. That is a measured claim. Measured claims are unfashionable, but occasionally useful.

The business value is cheaper diagnosis, not magical autonomy

The paper’s practical message is not “use LEAFE and your agents will become self-healing employees.” Please do not put that sentence in a pitch deck. The more grounded message is that agent systems should treat interaction logs as training assets, not just observability artifacts.

A typical business agent pipeline already records failures: API errors, rejected tool calls, user corrections, validation mismatches, wrong retrieved documents, failed browser actions, incomplete forms, bad SQL queries, and so on. The question is whether those failures are merely monitored, or whether they are converted into corrective supervision.

LEAFE suggests a useful operating model:

  1. Capture the full trajectory, not just the final outcome.
  2. Identify the earliest recoverable decision point, not merely the final error.
  3. Convert feedback into a compact diagnosis-and-fix instruction.
  4. Re-run from the rollback point under that instruction.
  5. Distill the corrected action back into the model or policy layer.

For a coding agent, this might mean mapping unit-test failures to the design decision that caused them, not simply asking the model to regenerate the entire file. For a browser automation agent, it might mean rolling back to the search query that led to irrelevant product results, rather than clicking through another dozen pages with quiet desperation. For an internal workflow agent, it might mean correcting the step where a user request was misclassified before the wrong API chain began.

The ROI path is therefore not mystical. It is operational:

Operational pain LEAFE-inspired response Expected business effect Boundary
High retry cost Train on corrected rollback branches Lower inference-time sampling and latency Requires repeatable environments
Brittle agents after small mistakes Teach recovery from diagnosed decision points Fewer unrecoverable workflow failures Feedback must be attributable
Manual cleanup of failed runs Convert human or environment corrections into SFT data Better future behavior from the same failure class Needs data governance and review
Overfitting to success examples Include counterfactual corrections, not only successful trajectories Broader policy coverage Not a substitute for safety validation

The strongest near-term use cases are structured environments: software engineering, data analysis workflows, API orchestration, form-processing systems, browser tasks with replayable states, and simulation-backed operations. These environments can provide diagnostic feedback and can often be reset or replayed.

The weaker use cases are messy, irreversible, or socially ambiguous environments. A customer-support conversation cannot always be rolled back in the user’s experience. A logistics decision may change real-world state. A legal or financial workflow may expose the system to compliance risk before the environment politely returns a parseable error. LEAFE’s mechanism assumes that mistakes can be identified, replayed, and corrected with enough fidelity to become training signal. In many business settings, that assumption is a design goal, not a given.

The implementation burden moves upstream

A tempting but wrong reading of the paper is that LEAFE reduces engineering. It does not. It changes where the engineering has to happen.

A retry-heavy system can be crude at training time and expensive at inference time. Sample more, vote more, reflect more, branch more, and let the runtime scaffold compensate for weak internal recovery. This is simple to prototype because the burden is pushed into orchestration.

LEAFE pushes more work upstream. You need environments that can reset, logs that preserve decision histories, prompts that produce useful rollback points, filters that prevent low-quality reflections from poisoning training data, and training infrastructure that can distill corrected actions without erasing core competence.

That is not a small requirement. But it is also why the paper matters. Production AI systems often become expensive not because the model cannot ever solve the task, but because the system must repeatedly wrap the model in external mechanisms to recover from the same classes of mistakes. If those recoveries can be converted into policy learning, the economics begin to change.

The appendix makes this concrete. The authors describe different environment interfaces for ALFWorld, ScienceWorld, WebShop, Sokoban, and CodeContests. Each environment returns a different kind of observation and action space. CodeContests returns compiler and test feedback; WebShop returns page content and clickable elements; Sokoban returns symbolic grid state; ScienceWorld requires valid action syntax using available objects. LEAFE is not learning from a generic “failure vibe.” It depends on structured interaction traces.

This is a useful warning for business adoption. Before asking whether a company should “train reflective agents,” ask whether its workflow environment is instrumented enough to support reflection. If the system cannot reconstruct what happened, locate the decision that mattered, or replay the state, then the training recipe has nothing clean to eat. And models, like interns, become less helpful when fed garbage and confidence.

Pass@1 and Pass@128 should not be forced into one story

One of the more valuable aspects of the paper is its insistence on separating single-attempt performance from broader capability. Many AI product discussions collapse both into “accuracy,” which is a convenient way to misunderstand everything at once.

Pass@1 asks: how often does one rollout work?

Pass@128 asks: if we sample many rollouts, does at least one work?

Neither metric is sufficient alone. Pass@1 matters because users do not want to wait for 128 attempts. Pass@128 matters because it reveals whether the policy distribution contains latent successful behaviors that may be made more accessible through better inference, training, or task design.

GRPO’s appeal is strongest in the Pass@1 frame. It can raise the probability of already-successful behaviors. LEAFE’s appeal is stronger in the Pass@k frame. It tries to expand the model’s recoverable behavior set by teaching corrections derived from feedback.

For a business reader, the choice is not philosophical. It depends on the bottleneck.

If your agent fails because it rarely chooses the obvious correct action on a familiar task, distribution sharpening may help. If your agent fails because it goes down the wrong branch and cannot recover, LEAFE’s framing is more relevant. If your agent fails because the environment gives no useful feedback, neither method will rescue you from bad system design. At some point, the model is not the bottleneck; the workflow is.

Where the paper is strongest, and where the boundary is clear

The strongest part of the paper is the alignment between mechanism and evidence. The method says it should improve recovery coverage. The results show stronger large-budget Pass@k. The Stage 1 ablation shows rollback branching beats independent sampling and linear refinement under fixed budgets. The Stage 2 ablation shows counterfactual distillation improves Pass@128 beyond rehearsal alone. The OOD test suggests less degradation than GRPO in a coding transfer setting.

That is a coherent evidence chain.

The boundary is also clear. LEAFE is most credible where three conditions hold:

Condition Why it matters
Feedback is diagnostic The model needs enough information to infer what went wrong
State can be reset or replayed Rollback branching depends on reconstructing earlier decision points
Corrected actions are reusable Distillation only helps if similar future contexts benefit from the correction

When these conditions fail, the method weakens. Delayed feedback makes credit assignment harder. Non-deterministic environments make rollback less reliable. Irreversible actions make exploration expensive or unsafe. Ambiguous human feedback can produce reflections that sound wise but encode the wrong causal story, which is a very AI way to make things worse while using better vocabulary.

The paper also does not remove the need for runtime monitoring. Internalized recovery can reduce dependence on external scaffolding, but it does not prove the agent will always detect dangerous failures, respect business constraints, or generalize to every workflow. Recovery training should be paired with permission controls, validators, audit logs, and escalation paths. Boring infrastructure remains undefeated.

From retry loops to learning loops

The usual agent story is that autonomy comes from giving a model tools, memory, and a larger action space. LEAFE points to a less glamorous requirement: autonomy also requires learning how to recover when those tools are used badly.

That is the useful shift in this paper. A failed trajectory should not be treated as waste. It should be treated as a partially labeled lesson: here is the observation, here is the bad branch, here is where recovery could have started, and here is the action that would have made the next attempt better.

Outcome-driven training asks, “Which trajectories succeeded?” Reflective-experience training asks, “Which decisions made failure recoverable, and how do we make those corrections native to the model?”

For companies building automation agents, that question is more practical than it sounds. The future of agent reliability will not come only from larger models or more elaborate retry trees. It will come from systems that turn operational failures into structured training signals. The agents that improve will not be the ones that merely try more. They will be the ones that learn where trying went wrong.

And yes, that is a slightly higher bar than “run it again.” Progress can be cruel like that.

Cognaptus: Automate the Present, Incubate the Future.


  1. Rui Ge, Yichao Fu, Yu-Yang Qian, Junda Su, Yiming Zhao, Peng Zhao, and Hao Zhang, “Internalizing Agency from Reflective Experience,” arXiv:2603.16843, version 1, March 17, 2026. https://arxiv.org/abs/2603.16843 ↩︎