Learning from Failure: When LLMs Finally Pay Attention

Opening — Why this matters now

Most people assume large language models improve by trying more.

More samples. More rollouts. More compute.

The industry calls it exploration. In practice, it often looks like guessing with confidence.

The paper “Experience is the Best Teacher” fileciteturn0file0 questions this quietly. Not by making models smarter—but by asking a more uncomfortable question:

What if the model already knows what good looks like… but doesn’t know how to get there?

Background — Context and prior art

Reinforcement learning for LLMs has followed a familiar path.

First came supervised fine-tuning (SFT), where models imitate examples. Then preference learning (DPO), where they learn what humans like. Then reinforcement learning with verifiable rewards (RLVR), where correctness is scored explicitly.

Each step tries to push models closer to “good behavior.”

But there’s a structural issue.

RL assumes that if you explore enough, you will eventually discover better actions. That assumption works in small action spaces. It breaks in language.

The action space of tokens is effectively infinite. Most of it is irrelevant. Worse, bad samples actively push the model in the wrong direction—because negative gradients spread across many tokens.

So exploration becomes noisy rather than informative.

Not inefficient. Misaligned.

Analysis — What the paper actually does

The paper introduces HeRL (Hindsight Experience Reinforcement Learning), which does something deceptively simple:

It tells the model why it failed.

More precisely, it converts failed outputs + unmet rubric criteria into structured feedback, and feeds that back into the model as context for improvement.

Instead of exploring blindly, the model explores with direction.

Core mechanism

The workflow has three steps:

Generate candidate responses
Evaluate them using rubric-based rewards
Use failed responses + unmet criteria as “hindsight experience” to guide revision

This creates a second trajectory—not from randomness, but from reflection.

Component	Traditional RLVR	HeRL
Exploration	Random / entropy-driven	Guided by feedback
Learning signal	Scalar reward	Reward + language guidance
Sample efficiency	Low	Higher
Failure usage	Discarded	Reused as training signal

The shift is subtle.

Failure is no longer waste.

It becomes instruction.

Why this works (quietly, but decisively)

The key insight is not algorithmic. It’s informational.

In standard RL, the model only sees how good a response is.

In HeRL, it sees why it is not good enough.

This changes the geometry of learning.

The paper shows that aligning exploration with high-reward regions reduces gradient noise. In simple terms, fewer useless updates, more directional ones.

Or more bluntly:

Stop wandering. Start adjusting.

Bonus reward — rewarding potential, not just outcome

A second addition is the “improvement potential” bonus.

If a bad answer can be easily improved, it gets rewarded slightly more than a bad answer that leads nowhere.

This introduces a new bias:

Not all failures are equal.

Some are stepping stones. Others are dead ends.

HeRL learns to tell the difference.

Findings — Results with visualization

The empirical results are consistent across models and domains.

Performance improvement (selected benchmarks)

Model	Method	Instruction Tasks	Writing	Medical
Qwen2.5-7B	RLVR	Moderate gain	Slight drop	Moderate gain
Qwen2.5-7B	HeRL	Strong gain	Improvement	Strong gain
Llama-3B	RLVR	Moderate gain	Moderate gain	Small gain
Llama-3B	HeRL	Larger gain	Larger gain	Significant gain

Two patterns stand out:

HeRL consistently outperforms RLVR across domains
It improves even on tasks it was not trained on (writing)

That second point matters.

It suggests the model is not memorizing better answers.

It is learning how to improve answers.

Exploration efficiency

The paper also shows that guided sampling outperforms both stochastic sampling and entropy-based search under equal budgets.

In other words:

More attempts do not equal better learning.

Better attempts do.

Test-time self-improvement

Perhaps the most interesting result is not during training.

It’s after.

The model can iteratively refine its own outputs using the same hindsight mechanism—improving performance without additional training.

That’s not just better learning.

That’s a primitive form of reflection.

Implications — What this means in practice

There’s a broader shift here, and it’s easy to miss.

1. RL is becoming less about reward, more about feedback

Scalar rewards are too compressed.

Language is richer.

HeRL effectively turns reward signals into instructions—bringing RL closer to supervised learning, but without labeled data.

2. Exploration is no longer a compute problem

For years, the solution to weak exploration was simple: increase sampling.

This paper suggests something different.

Exploration is a knowledge problem.

If the model knows what to fix, it doesn’t need to try as much.

3. Failure becomes an asset

Most pipelines discard bad outputs.

HeRL monetizes them.

In business terms, this is operational leverage.

The same data produces more learning.

4. This aligns with agentic workflows

The structure—generate → evaluate → revise—is not new.

It already exists in agent systems.

What HeRL does is formalize it inside training.

This is where things get interesting.

Because once reflection is internalized, agents stop depending on external orchestration.

They begin to self-correct.

Conclusion — Quiet shifts, lasting impact

Most improvements in AI look like scaling.

Bigger models. Larger datasets. More GPUs.

This one doesn’t.

It’s about using what the model already has—its ability to understand language—to guide its own learning.

Not faster.

Just more deliberate.

Over time, systems that learn from their own mistakes tend to outlast those that don’t.

Not because they are smarter.

But because they waste less.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Core mechanism#

Why this works (quietly, but decisively)#

Bonus reward — rewarding potential, not just outcome#

Findings — Results with visualization#

Performance improvement (selected benchmarks)#

Exploration efficiency#

Test-time self-improvement#

Implications — What this means in practice#

1. RL is becoming less about reward, more about feedback#

2. Exploration is no longer a compute problem#

3. Failure becomes an asset#

4. This aligns with agentic workflows#

Conclusion — Quiet shifts, lasting impact#