Failure is usually where an LLM training pipeline becomes wasteful.

A model generates a weak answer. A judge gives it a low score. The trainer nudges the policy away from that behavior and asks the model to try again. Repeat the ritual with more samples, more rollouts, more compute, and more optimism than the situation strictly deserves.

This is the familiar story of exploration in reinforcement learning for language models: if the model searches enough, it may eventually discover better behavior. The new paper Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs argues that this is the wrong place to be heroic.1 The problem is not only that the model explores too little. The problem is that it often explores without knowing what kind of failure it just produced.

HeRL, the paper’s proposed framework, changes the role of failure. Instead of treating a failed rollout as a bad sample to punish or discard, it pairs that rollout with the rubric items it failed to satisfy. The model then uses this “hindsight experience” as context to revise the answer. In plainer business language: the system does not merely say “wrong.” It says “wrong in these specific ways; now fix those parts without breaking the parts already working.”

That sounds modest. It is not. In open-ended tasks, where correctness is not a clean yes/no label, the difference between a scalar score and a diagnostic checklist is the difference between being told your presentation got 62/100 and being told that the market sizing is unsupported, the compliance section is vague, and the recommendation contradicts slide 4. One is a number. The other can actually improve the next draft.

The misconception: exploration is not just more randomness

The easy interpretation of reinforcement learning exploration is “try more things.” In language models, that quickly becomes “sample more responses,” “increase diversity,” or “branch at high-entropy tokens.” This is not entirely foolish. Diversity matters. A model that collapses too early into a narrow policy will stop discovering better answers.

But language is a very large action space. Most possible continuations are not interesting alternatives. They are merely different ways to be unhelpful. The paper’s theoretical section makes this point through the token-level gradient of the RL objective: high-quality samples with positive advantage raise the probability of sampled tokens and suppress alternatives, while low-quality samples with negative advantage push probability mass away from the sampled path and toward many unsampled tokens. In a small action space, that may be tolerable. In language, the unsampled space is enormous, and most of it is not the answer hiding in the bushes.

So the real exploration problem is not “How do we make the model wander more widely?” It is “How do we make exploration point toward the behaviors the reward function actually wants?”

This is where rubrics become more valuable than their usual role. In standard rubric-based RL, checklist items are aggregated into a reward. The rubric is compressed into a number. HeRL refuses to throw away the text. It uses unmet rubric items as language guidance, turning the reward system from a scoreboard into a coach. A slightly annoying coach, perhaps, but still more useful than a scoreboard.

The mechanism: a failed answer becomes a guided second attempt

HeRL works in three broad steps.

First, for each instruction, the policy samples several candidate responses. In the experiments, standard RLVR samples eight responses, while HeRL samples seven ordinary rollouts plus one hindsight-guided trajectory, keeping the rollout budget comparable. Each response is judged against checklist-style rubrics, producing both a scalar reward and a list of satisfied and unsatisfied rubric items.

Second, HeRL selects the failed trajectory with the highest reward for revision. This detail matters. The framework does not try to repair every bad answer. It looks for near-misses: answers that already satisfy many criteria but still miss some important items. The authors connect this to the zone of proximal development: learning is more efficient when feedback targets a gap the model can plausibly cross, rather than a canyon decorated with good intentions.

Third, the model is prompted to revise the selected trajectory using the original answer plus the unmet rubrics. The best revised response is added to the training group, and the model is trained on both original rollouts and the guided improvement. The prompt templates in the appendix are revealing. They explicitly ask the model to keep satisfied rubrics intact, make minimal necessary edits, avoid unsupported facts, preserve required formatting, and output only the revised answer. This is not generic “self-reflection.” It is controlled repair.

A simplified view looks like this:

Stage Standard rubric RLVR HeRL
Candidate generation Sample responses from the current policy Sample ordinary responses, then generate one guided revision
Evaluation Aggregate rubric satisfaction into a scalar reward Keep both scalar reward and unmet rubric language
Failure handling Penalize weak samples through RL updates Reuse near-miss failures as diagnostic context
Exploration target Whatever the current policy can stumble into Behaviors described by the missing rubric items
Training signal Reward only Reward plus revised trajectory conditioned on hindsight experience

The critical shift is not that HeRL adds more text to the prompt. The shift is that the training process learns from a trajectory that would not have appeared under blind sampling alone: an answer produced after the model is shown what was missing.

That is why the authors also add two stabilizers.

The first is a bonus reward. If a failed response can be improved substantially under guidance, its original reward is slightly increased by an improvement-potential term. In the paper, this bonus is controlled by $\alpha = 0.05$. The intuition is useful: not all failures are equal. A near-miss that can be repaired teaches more than a hopeless answer that would need to be rewritten from the ground up.

The second is policy shaping for revised trajectories. Since guided revisions may contain useful but low-probability tokens, HeRL uses a regularized importance sampling function, $f(x)=x/(x+\gamma)$ with $\gamma=1$ in the experiments, to encourage learning from desirable improvements without letting off-policy behavior destabilize training. The method also masks the loss on the hindsight-experience text itself, so the model learns to produce the revised answer rather than to imitate the feedback block. Good. We do not need a model that memorizes the teacher’s comments and forgets the homework.

The theoretical claim is about better gradient estimates, not magic reflection

The paper’s formal argument is compact but important. It says that the gap between the ideal gradient and the estimated gradient is bounded by the unsatisfied rubric weights. In practical terms: when a response satisfies more of the rubric, the gradient is a better proxy for the direction the reward function actually wants.

That does not prove that HeRL will always work. It does not mean rubrics are objective truth, and it does not make LLM judges immune to bias. The theorem supports a narrower claim: if rubrics define the target behavior, then reducing unmet rubric mass should reduce the mismatch between the training signal and the ideal reward direction.

This is why the mechanism-first reading matters. The paper is not merely saying “we tried a new RL recipe and the numbers improved.” It is saying that scalar rewards are too compressed for open-ended reasoning tasks, and that rubric language can guide the model toward better samples before the RL update happens.

The main results: HeRL wins because it improves the search, not because it sees more attempts

The paper evaluates HeRL across instruction-following, writing, and medical benchmarks, using Qwen2.5-7B-Instruct, Llama-3.2-3B-Instruct, and Qwen3-4B-Instruct-2507. The baselines are SFT, DPO, and standard rubric-level RLVR. WritingBench is especially useful because writing tasks are not included in the training data, making it a close-domain generalization test rather than a direct training match.

The headline result is straightforward: HeRL achieves the best result across all six reported in-domain or close-domain benchmarks for all three model families.

Model Method IFEval IFBench MulDimIF WritingBench LLMEval-Med HealthBench-500
Qwen2.5-7B Initial 72.6 26.2 51.4 57.0 56.0 24.4
Qwen2.5-7B RLVR 77.3 31.6 73.5 54.8 60.5 30.5
Qwen2.5-7B HeRL 82.4 39.7 83.4 59.1 65.0 34.3
Llama-3.2-3B Initial 71.2 23.8 35.8 30.5 16.1 14.5
Llama-3.2-3B RLVR 79.1 26.6 77.6 39.7 18.5 17.8
Llama-3.2-3B HeRL 82.4 30.6 84.7 45.4 18.7 26.6
Qwen3-4B Initial 83.4 29.9 57.3 84.3 74.5 42.0
Qwen3-4B RLVR 85.8 36.9 79.0 83.9 78.1 41.7
Qwen3-4B HeRL 86.1 39.7 82.5 85.7 79.3 43.6

The magnitude varies. On Qwen2.5-7B, HeRL improves IFBench from 31.6 under RLVR to 39.7, and MulDimIF from 73.5 to 83.4. On Llama-3.2-3B, HealthBench-500 rises from 17.8 under RLVR to 26.6. On Qwen3-4B, the gains are smaller but still mostly positive, which is what one would expect from a stronger starting model with less headroom.

The WritingBench result deserves attention. Qwen2.5-7B drops under SFT, DPO, and RLVR, but rises under HeRL. Llama-3.2-3B shows the same broad pattern: HeRL gives the strongest writing gain among the post-training methods. This suggests that HeRL is not merely overfitting to the exact training format. More cautiously, it suggests the model may be learning a transferable habit: preserve what works, diagnose what fails, and revise toward explicit criteria.

That is a useful habit. Some humans could try it as well.

The comparison tests show guidance beats blind diversity

Before the full training results, the paper runs a smaller empirical exploration test on 500 HealthBench questions using Qwen2.5-7B-Instruct. It compares stochastic sampling, entropy-based search, and guided sampling by hindsight experience under the same number of attempts.

The result is the paper’s cleanest evidence for its core mechanism. Guided sampling reaches 50.6% PassRate at four attempts. Stochastic sampling reaches 40.6%. Entropy-based search performs worse than stochastic sampling by the fourth attempt, reaching 38.0%.

This is not a full training result. It is a mechanism probe. Its likely purpose is to show that the guidance itself produces better candidate answers under equal sampling budgets. The finding supports the paper’s central claim that exploration quality matters more than raw diversity. Branching at high-entropy tokens may make outputs different, but different is not the same as closer to the rubric.

A practical translation:

Test Likely purpose What it supports What it does not prove
Guided sampling vs stochastic and entropy search Mechanism evidence Hindsight guidance produces better candidates under the same attempt budget That every training pipeline should use this exact selection rule
Main benchmark table Main evidence HeRL outperforms SFT, DPO, and RLVR across the tested models and domains That gains will transfer unchanged to all domains or judge models
OOD benchmarks Robustness check HeRL does not show obvious reasoning collapse on MATH-500, GPQA, and MMLU-Pro That HeRL improves every unrelated capability
Ablation table Component attribution Preserving hindsight context and adding bonus reward matter That the hyperparameters are optimal
Training dynamics Diagnostic analysis HeRL sustains entropy and validation reward better than RLVR That entropy alone caused the final gains
Test-time iterative revision Exploratory extension The learned revision behavior can be useful during inference That test-time self-improvement is safe or cost-effective in production by default

This distinction is important because papers often contain many figures, and not every figure should be promoted into a second thesis. Here, the auxiliary tests mainly reinforce one mechanism: guided near-miss repair gives the policy better material to learn from.

The ablation table says naive repair is not enough

One of the most useful parts of the paper is the ablation study. It compares four settings: baseline RLVR, naive hindsight experience, hindsight experience with loss masking and context preservation, and full HeRL with bonus reward.

The key result is not simply that the full method wins. The interesting result is that naive hindsight can hurt.

For Qwen3-4B, naive hindsight drops WritingBench from 83.9 under RLVR to 66.2. For Qwen2.5-7B, naive hindsight improves IFBench slightly but lowers WritingBench and HealthBench-500. The authors interpret this as off-policy instability: directly reusing revised responses without preserving the hindsight context treats the revised answer as if it came from the ordinary policy distribution, when it actually came from a guided repair process.

That is the engineering lesson. The gain does not come from dumping better-looking revised answers into training data. It comes from preserving the causal structure: original attempt, failed rubric items, guided revision, masked feedback text, and RL update over the model-generated response.

This matters for enterprise teams because the tempting shortcut is obvious. Collect failed outputs, ask a stronger model to fix them, fine-tune on the fixes, and call it a day. Sometimes that may work. But this paper suggests that for rubric-guided RL, the context of repair is part of the signal. Remove it, and the “improvement data” may become noisy off-policy residue.

Very glamorous. Also very expensive to debug.

OOD results: no obvious collapse, not a universal upgrade

The paper also checks whether HeRL harms out-of-distribution reasoning on MATH-500, GPQA, and MMLU-Pro. The results are mixed but broadly stable.

Llama-3.2-3B improves on GPQA and MMLU-Pro but drops on MATH-500. Qwen2.5-7B improves on MATH-500 but drops slightly on GPQA and MMLU-Pro. Qwen3-4B improves on MATH-500 and MMLU-Pro but drops on GPQA.

The correct reading is modest: HeRL does not show obvious OOD collapse in these tests. It is not evidence that rubric-guided medical and instruction-following training magically upgrades all reasoning domains. The practical takeaway is defensive rather than triumphant. The method appears to improve target capabilities without severely damaging unrelated benchmarks in the reported setting.

That is already useful. In enterprise deployment, avoiding regression is often less exciting than achieving a leaderboard jump, but it is also what prevents a model update from turning into an internal incident report.

Test-time revision hints at a product pattern

The paper’s test-time self-improvement experiment is especially relevant for business systems. On HealthBench-500, the authors compare ordinary Pass@$k$ sampling against iterative guided revision. Starting from round two, the model uses the original trajectory and unmet rubrics as hindsight guidance for the next round.

For RLVR, iterative revision rises from 41.7 at one round to 68.2 by round four, while Pass@$k$ reaches 52.9. For HeRL, iterative revision rises from 43.5 to 72.3 by round four, while Pass@$k$ reaches 55.0. The larger gain under HeRL suggests that the model has internalized some ability to use the guidance, rather than merely benefiting from generic in-context correction.

This is not just a training result. It points to an application design:

  1. Generate an answer.
  2. Evaluate it against explicit rubrics.
  3. Preserve the satisfied criteria.
  4. Revise only the failed criteria.
  5. Repeat within a bounded budget.

Many agentic systems already use this pattern externally. HeRL shows a way to train the model so that the pattern becomes more natural to the model itself. That does not remove the need for external evaluation. It may reduce the amount of orchestration needed to make correction work.

For Cognaptus-style automation work, this is the business-relevant bridge. The practical value is not “the model reflects.” Reflection is a metaphor, and metaphors tend to invoice poorly. The value is cheaper diagnosis, better reuse of failed outputs, and a more structured path from evaluation to revision.

What businesses can infer, and what they cannot

The paper directly shows that HeRL improves performance over the tested baselines across the selected instruction-following, writing, and medical benchmarks. It also shows that guided sampling finds better candidates under equal attempt budgets, that naive repair is insufficient, and that the trained model benefits from iterative guided revision at test time.

Cognaptus can reasonably infer three operational lessons.

First, rubric design becomes infrastructure. If unmet rubric items are used as guidance, the quality of the rubric determines the quality of exploration. A vague rubric produces vague repair. A narrow rubric may overtrain the model toward surface compliance. A well-designed rubric becomes a reusable diagnostic asset.

Second, failure logs should be kept in structured form. A bad answer alone is low-value data. A bad answer plus satisfied criteria, failed criteria, judge rationale, model revision, and final score is much more useful. The data model matters. The boring database schema is, once again, where the strategy quietly lives.

Third, near-miss selection may be more profitable than blanket correction. HeRL focuses on high-reward failures because they are close enough to be repairable. For enterprise tuning, this suggests prioritizing borderline outputs: almost-correct customer service answers, partially compliant reports, draft analyses that meet most but not all requirements. These are the samples most likely to teach the model how to cross the final gap.

What remains uncertain is equally important. The experiments rely on rubric datasets with broad coverage, GPT-4o mini as the rubric judge, and specific model families. The paper does not prove that the method will work equally well where rubrics are politically contested, legally delicate, or hard to operationalize. It also does not settle the cost trade-off: generating guided revisions and judge evaluations may be cheaper than blind rollouts in some settings, but the economics depend on model costs, latency budgets, and acceptable failure rates.

The boundary: HeRL needs good rubrics, not just more clever training

The authors identify two core limitations. First, HeRL depends on high-quality, broad-coverage rubric datasets, which are scarce and expensive. Second, rubrics are predefined and static, while the model’s capability boundary changes during training. A rubric that was useful early may become uninformative later; a missing criterion may become the new bottleneck.

This is not a footnote-level concern. It defines where the method is likely to matter.

HeRL is strongest where tasks can be decomposed into meaningful criteria: medical QA checklists, instruction-following constraints, compliance review, structured writing, customer support quality, internal research memos, and analytical reports. It is weaker where success is ambiguous, taste-driven, adversarial, or dependent on facts the judge cannot verify.

The next obvious extension is adaptive rubrics: rubrics that evolve as the model improves, focusing evaluation on the criteria the model still misses. That would make the feedback loop more like a competent tutor and less like a laminated checklist from a training seminar. The paper hints at this direction, and it is probably the right one.

Conclusion: the useful failure is the one that explains itself

HeRL is interesting because it attacks a specific inefficiency in LLM reinforcement learning. It does not simply ask the model to sample more. It asks the model to reuse its own failed attempts, provided those failures come with a diagnosis.

That is the central business lesson. In AI systems, failure is not automatically valuable. Most failure is just noise wearing a lesson-shaped hat. Failure becomes valuable when it is structured: what worked, what failed, why it failed, how it was revised, and whether the revision actually improved the score.

For companies building LLM workflows, the message is clear enough. Do not only collect outputs. Collect the evaluation traces around them. Do not only score answers. Preserve the rubric-level reasons. Do not only ask the model to try again. Tell it what to keep, what to fix, and what not to invent along the way.

That is not glamorous AI. It is better process design.

Which, inconveniently for the hype cycle, is often where the real gains hide.

Cognaptus: Automate the Present, Incubate the Future.


  1. Wenjian Zhang, Kongcheng Zhang, Jiaxin Qi, Baisheng Lai, and Jianqiang Huang, “Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs,” arXiv:2603.20046, 2026. https://arxiv.org/pdf/2603.20046 ↩︎