Opening — Why this matters now

Most people assume large language models improve by trying more.

More samples. More rollouts. More compute.

The industry calls it exploration. In practice, it often looks like guessing with confidence.

The paper “Experience is the Best Teacher” fileciteturn0file0 questions this quietly. Not by making models smarter—but by asking a more uncomfortable question:

What if the model already knows what good looks like… but doesn’t know how to get there?

Background — Context and prior art

Reinforcement learning for LLMs has followed a familiar path.

First came supervised fine-tuning (SFT), where models imitate examples. Then preference learning (DPO), where they learn what humans like. Then reinforcement learning with verifiable rewards (RLVR), where correctness is scored explicitly.

Each step tries to push models closer to “good behavior.”

But there’s a structural issue.

RL assumes that if you explore enough, you will eventually discover better actions. That assumption works in small action spaces. It breaks in language.

The action space of tokens is effectively infinite. Most of it is irrelevant. Worse, bad samples actively push the model in the wrong direction—because negative gradients spread across many tokens.

So exploration becomes noisy rather than informative.

Not inefficient. Misaligned.

Analysis — What the paper actually does

The paper introduces HeRL (Hindsight Experience Reinforcement Learning), which does something deceptively simple:

It tells the model why it failed.

More precisely, it converts failed outputs + unmet rubric criteria into structured feedback, and feeds that back into the model as context for improvement.

Instead of exploring blindly, the model explores with direction.

Core mechanism

The workflow has three steps:

  1. Generate candidate responses
  2. Evaluate them using rubric-based rewards
  3. Use failed responses + unmet criteria as “hindsight experience” to guide revision

This creates a second trajectory—not from randomness, but from reflection.

Component Traditional RLVR HeRL
Exploration Random / entropy-driven Guided by feedback
Learning signal Scalar reward Reward + language guidance
Sample efficiency Low Higher
Failure usage Discarded Reused as training signal

The shift is subtle.

Failure is no longer waste.

It becomes instruction.

Why this works (quietly, but decisively)

The key insight is not algorithmic. It’s informational.

In standard RL, the model only sees how good a response is.

In HeRL, it sees why it is not good enough.

This changes the geometry of learning.

The paper shows that aligning exploration with high-reward regions reduces gradient noise. In simple terms, fewer useless updates, more directional ones.

Or more bluntly:

Stop wandering. Start adjusting.

Bonus reward — rewarding potential, not just outcome

A second addition is the “improvement potential” bonus.

If a bad answer can be easily improved, it gets rewarded slightly more than a bad answer that leads nowhere.

This introduces a new bias:

Not all failures are equal.

Some are stepping stones. Others are dead ends.

HeRL learns to tell the difference.

Findings — Results with visualization

The empirical results are consistent across models and domains.

Performance improvement (selected benchmarks)

Model Method Instruction Tasks Writing Medical
Qwen2.5-7B RLVR Moderate gain Slight drop Moderate gain
Qwen2.5-7B HeRL Strong gain Improvement Strong gain
Llama-3B RLVR Moderate gain Moderate gain Small gain
Llama-3B HeRL Larger gain Larger gain Significant gain

Two patterns stand out:

  • HeRL consistently outperforms RLVR across domains
  • It improves even on tasks it was not trained on (writing)

That second point matters.

It suggests the model is not memorizing better answers.

It is learning how to improve answers.

Exploration efficiency

The paper also shows that guided sampling outperforms both stochastic sampling and entropy-based search under equal budgets.

In other words:

More attempts do not equal better learning.

Better attempts do.

Test-time self-improvement

Perhaps the most interesting result is not during training.

It’s after.

The model can iteratively refine its own outputs using the same hindsight mechanism—improving performance without additional training.

That’s not just better learning.

That’s a primitive form of reflection.

Implications — What this means in practice

There’s a broader shift here, and it’s easy to miss.

1. RL is becoming less about reward, more about feedback

Scalar rewards are too compressed.

Language is richer.

HeRL effectively turns reward signals into instructions—bringing RL closer to supervised learning, but without labeled data.

2. Exploration is no longer a compute problem

For years, the solution to weak exploration was simple: increase sampling.

This paper suggests something different.

Exploration is a knowledge problem.

If the model knows what to fix, it doesn’t need to try as much.

3. Failure becomes an asset

Most pipelines discard bad outputs.

HeRL monetizes them.

In business terms, this is operational leverage.

The same data produces more learning.

4. This aligns with agentic workflows

The structure—generate → evaluate → revise—is not new.

It already exists in agent systems.

What HeRL does is formalize it inside training.

This is where things get interesting.

Because once reflection is internalized, agents stop depending on external orchestration.

They begin to self-correct.

Conclusion — Quiet shifts, lasting impact

Most improvements in AI look like scaling.

Bigger models. Larger datasets. More GPUs.

This one doesn’t.

It’s about using what the model already has—its ability to understand language—to guide its own learning.

Not faster.

Just more deliberate.

Over time, systems that learn from their own mistakes tend to outlast those that don’t.

Not because they are smarter.

But because they waste less.

Cognaptus: Automate the Present, Incubate the Future.