From Retry to Recovery: Teaching AI Agents to Learn from Their Own Mistakes

Opening — Why this matters now

Everyone wants autonomous agents. Few seem willing to admit that most of them are still glorified retry machines.

In production systems—from coding copilots to web automation agents—the dominant strategy is embarrassingly simple: try, fail, try again, and hope that one trajectory sticks. This works, but only if you can afford the latency, compute cost, and engineering complexity of massive sampling.

The paper “Internalizing Agency from Reflective Experience” fileciteturn0file0 offers a subtle but important shift: instead of relying on brute-force retries, teach the model how to recover from mistakes internally.

That distinction—retry vs. recovery—is where the economics of agentic AI quietly changes.

Background — The limits of outcome-driven learning

Most modern agent training pipelines rely on Reinforcement Learning with Verifiable Rewards (RLVR). In practice, this means:

Sample multiple trajectories
Assign a final reward (success/failure)
Reinforce the successful ones

On paper, this sounds reasonable. In reality, it creates a structural blind spot.

The core problem: distribution sharpening

As described in the paper, RLVR tends to concentrate probability mass on already-successful behaviors, a phenomenon referred to as distribution sharpening (p.1–2).

This leads to a predictable pattern:

Metric	What improves	What stagnates
Pass@1	Increases (better single guess)	—
Pass@k (large k)	Weak improvement	Exploration capacity

In other words, the model becomes better at repeating what it already knows, but worse at discovering new solutions.

For long-horizon tasks—coding, planning, tool use—this is fatal. Success depends less on getting the first step right, and more on recovering when things go wrong.

Yet RLVR largely ignores the richest signal available: environment feedback during failure.

Analysis — What LEAFE actually does differently

The proposed framework, LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), introduces a two-stage shift from outcome-based learning to experience-based correction.

Stage 1: Reflect, rollback, and branch

Instead of treating failures as useless trajectories, LEAFE extracts structured experience from them:

The agent reflects on a failed trajectory
Identifies a critical mistake point (τ)
Generates an experience summary (diagnosis + fix)
Rolls back and explores an alternative path

Conceptually, this creates a tree of trajectories—not random retries, but guided counterfactuals.

Approach	Exploration style	Signal quality
Independent sampling	Random retries	Low
Iterative refinement	Linear correction	Medium
LEAFE rollback tree	Branching + reflection	High

As illustrated in the Sokoban example (Figure 5, p.14), the model repeatedly rewinds to a key mistake and tries a corrected action, eventually reaching a successful trajectory.

This is not exploration—it’s structured hindsight.

Stage 2: Distill experience into the model

The more interesting move happens next.

Instead of keeping these experiences as external memory (like prompt-based agents), LEAFE distills them into model weights.

Two training signals are combined:

Component	Purpose
Behavior rehearsal	Preserve successful behaviors
Counterfactual distillation	Learn corrected actions without explicit guidance

The second component is the key. It teaches the model:

“Given the same situation, choose the better action—even without the explanation.”

This is what the authors call agency internalization.

Findings — What actually improves (and what doesn’t)

The results are unusually consistent across benchmarks.

1. Pass@k improves significantly

From Table 2 (p.6):

Model	GRPO Pass@128	LEAFE Pass@128	Improvement
Qwen2.5-72B	36.97	47.88	+10.9
Llama3-70B	27.88	33.94	+6.1

The headline claim—up to +14% improvement—is not marketing fluff. It reflects genuine expansion of the model’s capability boundary.

2. Sample efficiency improves

Figure 3 (p.8) shows that LEAFE reaches the same success rate with fewer samples.

Translation: less compute, lower latency, simpler systems.

3. Pass@1 gains are modest (and that’s fine)

Interestingly, LEAFE does not always outperform RLVR at Pass@1.

This is not a weakness—it’s a design choice.

Strategy	Optimization bias
RLVR (GRPO)	Exploitation (best single guess)
LEAFE	Exploration + recovery

If your KPI is demo performance, RLVR still looks attractive. If your KPI is robust deployment, LEAFE wins.

4. Better generalization under distribution shift

From Table 5 (p.7):

RLVR shows performance degradation on new tasks
LEAFE maintains or improves performance

This suggests a deeper claim: LEAFE is not just memorizing trajectories—it is learning transferable recovery strategies.

Implications — Why this matters for real systems

This paper quietly challenges a dominant assumption in agent design:

That more sampling is the solution to uncertainty.

It isn’t. It’s just the most convenient workaround.

1. Compute vs. intelligence trade-off

Most current systems rely on:

Tree search
Self-consistency
Multi-agent voting

These are all external scaffolding to compensate for weak internal agency.

LEAFE shifts capability into the model itself, reducing reliance on expensive inference-time tricks.

2. Toward self-correcting agents

The real innovation is not rollback—it’s learning how to rollback.

This moves agents closer to:

Detecting failure early
Identifying causal mistakes
Executing targeted corrections

In business terms: fewer retries, faster convergence, more predictable behavior.

3. Practical constraints

The paper is refreshingly honest about limitations:

Requires high-quality feedback signals
Assumes environment reset capability
Less effective in noisy or ambiguous environments

In other words, this works best in structured systems (coding, simulations, APIs)—not messy real-world workflows. Yet.

Conclusion — From sampling to agency

The industry narrative around AI agents has been dominated by scale: more tokens, more samples, more retries.

This paper suggests a different direction.

Not bigger search trees—better decision-making inside the model.

LEAFE doesn’t make agents smarter in the conventional sense. It makes them less dependent on luck.

And in production systems, that’s usually the difference between a demo and a product.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The limits of outcome-driven learning#

The core problem: distribution sharpening#

Analysis — What LEAFE actually does differently#

Stage 1: Reflect, rollback, and branch#

Stage 2: Distill experience into the model#

Findings — What actually improves (and what doesn’t)#

1. Pass@k improves significantly#

2. Sample efficiency improves#

3. Pass@1 gains are modest (and that’s fine)#

4. Better generalization under distribution shift#

Implications — Why this matters for real systems#

1. Compute vs. intelligence trade-off#

2. Toward self-correcting agents#

3. Practical constraints#

Conclusion — From sampling to agency#