When Rewards Learn Back: Evolution, but With Gradients

Rewards are where many agent projects go to become expensive folklore.

A team wants an AI agent to complete long workflows: search, reason, call tools, check constraints, recover from mistakes, and produce a useful answer. The model can talk. The tools work. The benchmark demo is acceptable. Then reinforcement learning enters the room, and someone has to decide what “good” means at every step.

Give only a final success reward, and the signal is often too sparse. The agent may stumble through a long trajectory and learn almost nothing from the individual decisions that mattered. Add a few handcrafted intermediate rewards, and the system may start optimizing the wrong rituals: nice formatting, longer reasoning, harmless-looking subgoals, or whatever accidental proxy happens to pay. Train a large reward model, and the problem becomes a data-labeling and governance exercise with a research budget wearing a business hat.

The paper Differentiable Evolutionary Reinforcement Learning proposes a more interesting route: do not just train the agent with a reward function; train another model to design the reward function, using the agent’s validation performance as feedback.¹ That sounds dangerously close to “let the model invent its own objectives,” which is normally where one reaches for coffee and legal review. But the actual mechanism is narrower, more structured, and more useful than the slogan.

The point of DERL is not that reward design disappears. The point is that reward design becomes an optimization problem with a learnable direction.

The old reward problem is not scarcity; it is credit assignment

The paper starts from a familiar reinforcement learning bottleneck. In complex reasoning tasks, outcome rewards are clean but thin. A math answer is correct or not. An ALFWorld household task is completed or not. A ScienceWorld experiment succeeds or fails. These signals are verifiable, which is good, but they arrive late, which is inconvenient. A late reward is like a manager who says “bad quarter” and then refuses to discuss sales, pricing, product, or execution. Technically feedback, practically rude.

Dense rewards solve part of this. They give the policy more local guidance. But dense rewards are difficult to design because they smuggle assumptions into training. A format reward may help if the task requires structured output. It may hurt if the model learns to worship the box and neglect the answer inside it. A process reward may help if it captures real progress. It may become a decorative checklist if it does not.

This is the narrow technical space DERL enters. It does not ask an LLM to write arbitrary reward code from vibes. It starts with simple atomic primitives: small executable reward components such as final correctness, formatting, temporal progress signals in an agent trajectory, step indicators, or soft answer matching. The system then learns how to compose those primitives into a Meta-Reward.

That distinction matters. DERL does not eliminate human judgment from the reward universe. Humans still define the primitive vocabulary. But the paper’s claim is that humans do not need to hand-design the final reward composition. The search over structure and weights can be learned.

A simple way to read the method is:

Layer	What is being optimized	Feedback signal	Practical meaning
Inner loop	The policy agent	Meta-Reward generated from primitives	“Train the worker using this incentive scheme.”
Outer loop	The Meta-Optimizer	Validation performance of the trained policy	“Update the incentive designer based on whether the worker actually improved.”

This is the first important correction to a likely misconception. DERL is not merely prompt-based reward hacking. It is also not random evolutionary search with a better press release. The Meta-Optimizer is parameterized and updated through reinforcement learning. Its generated reward structures are evaluated by training inner policies and measuring validation performance. The outer loop then updates the Meta-Optimizer so future reward proposals become better.

In other words, the system tries to learn not just which reward worked, but how reward changes tend to affect policy performance.

DERL turns reward evolution into a bi-level learning loop

The paper describes DERL as a bi-level framework. That phrase can sound heavier than it is. The basic loop is straightforward:

A Meta-Optimizer receives a task instruction.
It generates a symbolic reward configuration using atomic primitives.
An inner-loop policy is trained with that generated reward.
The trained policy is evaluated on a validation split.
That validation score becomes the outer-loop reward for updating the Meta-Optimizer.
The Meta-Optimizer generates better reward configurations in later rounds.

The mechanism can be summarized as:

Task instruction
      ↓
Meta-Optimizer generates reward structure
      ↓
Inner policy trains under that reward
      ↓
Validation performance measures whether the reward actually helped
      ↓
Meta-Optimizer updates toward reward structures that produce better policies

The paper calls this a way to capture an estimated meta-gradient. This does not mean DERL directly backpropagates through every step of inner-loop policy training in the usual differentiable programming sense. The authors are careful in the related-work discussion: DERL avoids differentiating through the full inner-loop optimizer. Instead, it treats reward generation as a higher-level RL problem. The outer model learns from validation-performance feedback over generated reward configurations.

That is why the “evolution” part and the “differentiable” part both matter. The system still explores possible reward structures. But unlike black-box mutation, the exploration is not supposed to remain blind. The Meta-Optimizer itself changes its parameters.

This is the business-relevant piece hiding under the technical vocabulary. In an enterprise agent, the practical question is rarely “Can we write a reward?” It is “Can we maintain reward logic as tasks, tools, users, and failure modes change?” DERL suggests one answer: keep human-designed primitives relatively simple, then let a trained meta-level system discover how to combine them for the actual operating environment.

Not magic. More like replacing a spreadsheet of hand-tuned incentives with an incentive-design model that is itself trained against validation outcomes. Slightly less romantic, much more useful.

The reward search space is constrained on purpose

A weaker version of this idea would ask an LLM to produce arbitrary reward functions. That would be expressive, dramatic, and operationally cursed. Arbitrary generated code creates validity problems, security problems, interpretability problems, and a search space large enough to hide several bad decisions.

DERL avoids that by using structured primitives. For ALFWorld and ScienceWorld, the paper uses four primitives: final outcome plus trajectory-stage rewards computed over the first, middle, and final thirds of the interaction. For GSM8K and MATH, it uses correctness, boxed-answer formatting, step-by-step indicators, and a soft outcome match where the correct answer appears somewhere in the raw output.

These primitives are intentionally simple. The paper’s argument is not that the authors discovered a brilliant handcrafted reward vocabulary. It is almost the opposite. The primitives are ordinary enough that the contribution shifts to the composition mechanism.

This is important for interpretation. If DERL only worked because the authors quietly engineered perfect primitives, the framework would be less impressive. The appendix directly tests this concern by removing individual primitives and even reversing one primitive into a toxic signal. DERL still performs well, and in the reversed-signal case it appears to learn how to use the flipped signal with an appropriate negative coefficient. That is not a full proof of primitive-agnostic robustness, but it does make the paper’s claim more credible: the method is not simply a fragile wrapper around a hand-designed reward recipe.

The main evidence is strongest where reward design is hardest

The paper evaluates DERL across three domains: ALFWorld for embodied household-style agent tasks, ScienceWorld for interactive scientific simulation, and GSM8K/MATH for mathematical reasoning.

The most interesting results are not the average gains. They are the out-of-distribution gains in agent tasks.

In ALFWorld, the paper uses three generalization levels:

L0: trained and evaluated on seen task variants;
L1: trained on all task types but evaluated on unseen variants;
L2: trained on four task types and evaluated on two unseen task types.

The L2 setting is where brittle reward design gets exposed. On ALFWorld L2, GRPO with outcome reward reaches 29.7%, GRPO with average primitive reward reaches 30.5%, RLVMR reaches 56.3%, DERL reaches 65.0%, and DERL-pop reaches 76.4%. The average-reward baseline looks useful in-distribution, but barely improves over outcome reward in the hardest OOD setting. That is the exact failure mode business users should care about: a reward design that looks sensible during development and then becomes decorative under distribution shift.

ScienceWorld shows the same broad pattern, though with a different shape. On ScienceWorld L2, outcome reward reaches 10.9%, average reward reaches 18.0%, RLVMR reaches 26.5%, DERL reaches 30.1%, and DERL-pop reaches 31.3%. The spectacular DERL-pop numbers are actually in L0 and L1—98.2% and 95.3%—while L2 improves only modestly over standard DERL. That distinction matters. DERL-pop seems very powerful when the policy can evolve through a curriculum-like sequence, but the hardest distribution shift remains hard. Reality, annoyingly, continues to exist.

For mathematical reasoning, the gains are smaller but still informative. With MATH+GSM8K training data, outcome reward gets 82.6% on GSM8K and 58.8% on MATH. DERL reaches 87.0% and 60.2%. DERL-pop reaches 87.6% and 60.2%. With MATH-only training, DERL-pop reaches 84.1% on GSM8K and 60.9% on MATH.

The smaller math gains make sense. Math has stronger verifiable outcomes than long-horizon agent tasks. When correctness is already a clean signal, reward composition has less room to rescue the training process. Still, the paper reports a useful warning: adding obvious auxiliary rewards can hurt. Outcome+Format and Avg Reward improve GSM8K in some settings but degrade MATH relative to outcome-only. Apparently, even models can be distracted by paperwork.

The results table is not the full argument

The paper’s evidence works best when separated by purpose. Some tests provide main evidence. Others are ablations, robustness checks, comparisons with prior methods, or cost analysis. Treating them all as one pile of “more experiments” makes the paper harder to understand.

Evidence item	Likely purpose	What it supports	What it does not prove
ALFWorld and ScienceWorld main table	Main evidence and comparison with prior work	DERL improves success rates, especially OOD agent generalization	That the method will transfer unchanged to production workflows
GSM8K and MATH table	Main evidence in verifiable reasoning	DERL can help even when outcome reward is already strong	That reward evolution is equally valuable in all reasoning domains
Validation/test trajectory curves	Mechanism evidence	Outer-loop performance improves progressively rather than behaving like pure sampling	A formal guarantee of meta-gradient quality
Stable vs unstable reward structures	Exploratory mechanism analysis	Learned rewards increasingly favor bounded, stable structures	That all learned rewards are safe or aligned
RLAIF/GPT-4o comparison	Comparison with black-box LLM refinement	Parameter-updated small Meta-Optimizer beats prompt-only refinement in the tested setup	That DERL beats all possible LLM-agent reward-design systems
Compute-controlled random search	Robustness against “more compute” explanation	DERL’s gains are not just random sampling under equal search budget	That compute cost is negligible
Primitive removal/reversal tests	Ablation and robustness test	DERL is not overly dependent on one carefully chosen primitive set	That primitive design no longer matters
Rollout-count sensitivity	Sensitivity test	Similar performance appears with 4, 6, and 8 outer rollouts in the tested setup	That rollout count is universally unimportant

The compute-controlled random-search test is especially useful because it attacks the obvious skeptical explanation: perhaps DERL wins because it simply tries more reward functions. In the appendix, random search uses the same reward space and compute setting, with the outer-loop parameters frozen. It reaches 76.50% test performance on the tested GSM8K setup, while DERL reaches 79.22%. More importantly, the stepwise dynamics differ: DERL improves across iterations, while random search does not show the same progressive optimization.

The RLAIF comparison is also interesting because the baseline uses GPT-4o in the outer loop. The prompt-only LLM refinement baseline reaches 77.18% test performance, while DERL reaches 79.22% using a much smaller 0.5B Meta-Optimizer. This does not mean small open models generally beat GPT-4o. It means that in this reward-search setup, a parameter-updated optimizer beats a stronger but static prompted refiner. The lesson is not “smaller is better.” The lesson is “learning beats commenting.”

The mechanism evidence explains why DERL generalizes better

The paper’s strongest conceptual claim is that DERL learns an optimization direction. The authors support this with training dynamics: as the outer loop progresses, both validation and test performance rise on ALFWorld, GSM8K, and MATH. ScienceWorld is omitted from this figure because the Meta-Optimizer reportedly converges faster.

This is not just a decorative curve. It addresses the central concern: maybe the outer loop is overfitting to validation cases or sampling lucky reward functions. If validation rises while test does not, the method is just learning to please the tuning split. If both rise together, the case for a generalizable meta-signal becomes stronger.

There is still a boundary. Rising validation and test curves do not mathematically prove that DERL captures the true causal structure of the task. But they do show something more useful than a one-time leaderboard score: the optimization trajectory behaves like learning rather than lottery.

The structural analysis goes one level deeper. The authors categorize generated reward structures into stable, unstable, and invalid forms. Stable structures include linear combinations and normalization-like operations that bound outputs. Unstable structures rely on unbounded products, which can create harsh veto effects: if one primitive is near zero, the whole reward collapses. Invalid structures penalize desirable behaviors or otherwise lack optimization utility.

Over outer-loop training on ALFWorld, stable structures increasingly dominate while unstable structures decline. This is a quietly important result. The authors did not directly tell the Meta-Optimizer, “Please prefer numerically stable reward compositions because variance is annoying.” The system appears to discover that bounded, robust reward structures produce better inner policies.

For business readers, this is the point where the paper becomes more than another benchmark improvement. DERL is not only selecting among reward components. It is also learning design regularities: stable reward math tends to train better agents. That is the sort of tacit engineering knowledge teams usually accumulate through painful trial and error, postmortems, and suspicious Slack threads.

DERL-pop is a curriculum mechanism, not just a variant name

The paper includes two versions: standard DERL and DERL-pop.

In standard DERL, the inner-loop policy resets to the base model for each outer-loop iteration. This makes attribution cleaner: performance changes are more directly tied to the current reward configuration. In DERL-pop, the next inner-loop policy starts from the best-performing checkpoint of the previous iteration. This creates a population-like training dynamic where rewards and policies evolve together.

The authors describe this as curriculum-like. That interpretation is plausible. Early reward structures may help the policy reach a better region; later reward structures can then optimize from that region rather than restarting from scratch. In ScienceWorld L0 and L1, this seems extremely effective: DERL-pop jumps to 98.2% and 95.3%, far above standard DERL’s 47.7% and 43.0%.

But the L2 result is a useful corrective. DERL-pop reaches 31.3% on ScienceWorld L2, only slightly above DERL’s 30.1%. A curriculum can accelerate mastery of related tasks, but it does not automatically solve unseen task types. The population dynamic is operationally attractive, especially when compute can be reused, but it is not a universal antidote to distribution shift.

The more practical reading is this: DERL-pop may be valuable when the target environment has a stable progression of task difficulty or repeated workflow families. It may be less transformative when deployment demands transfer into genuinely new task categories. That difference matters for enterprise use. Customer-service workflow variants are not the same as a newly introduced regulatory procedure. One is variation; the other is sometimes a different animal with a similar name badge.

The business value is adaptive incentive design for agents

What does this paper directly show?

It shows that a learnable Meta-Optimizer can generate reward compositions from simple primitives, train inner policies using those rewards, and improve performance across several benchmark domains. It shows particularly strong OOD gains in ALFWorld and meaningful gains in ScienceWorld and mathematical reasoning. It also shows that the outer loop behaves progressively, that stable reward structures become more common, and that DERL beats prompt-only refinement and compute-controlled random search in the tested setups.

What can Cognaptus reasonably infer for business use?

The clearest inference is that enterprise agent reliability may depend less on writing one perfect reward and more on building a reward-optimization layer. In long-horizon workflows, teams can define primitive signals that reflect observable progress: task completion, constraint satisfaction, tool-call validity, intermediate-state quality, escalation correctness, document consistency, or verification results. A DERL-like system could then learn how those signals should be weighted and composed for different workflow families.

That has obvious relevance for:

tool-using enterprise agents;
scientific or engineering simulation assistants;
compliance workflows with verifiable intermediate checks;
coding and formal verification assistants;
operations agents that must complete multi-step procedures;
educational or training agents where final correctness is too delayed to guide learning efficiently.

The ROI story is not “training becomes free.” It does not. The business value is better diagnosis and adaptation. Instead of debating whether a failure came from the base model, prompt, tool API, task decomposition, or reward proxy, a DERL-style workflow gives teams a structured way to test reward designs against validation performance.

A useful enterprise version would probably look less like a research training loop and more like a reward governance system:

Enterprise layer	DERL-inspired role	Business benefit
Primitive library	Defines measurable signals from workflow logs	Keeps reward design auditable
Meta-reward optimizer	Learns compositions for task families	Reduces manual tuning
Validation suite	Measures policy performance under controlled splits	Prevents reward changes from becoming vibes
OOD task bank	Tests unseen workflow categories	Reveals brittle incentives before deployment
Reward registry	Stores selected reward structures and performance history	Supports governance and rollback

This is where the paper’s mechanism-first framing matters. If we only summarize DERL as “better benchmark scores,” the business takeaway becomes shallow. The real idea is that reward engineering can move from artisanal tweaking to versioned, testable, learnable incentive design.

The deployment boundary is compute, primitives, and safety

DERL is not a plug-and-play recipe for production agent alignment. The paper is clear about several boundaries, and some are more important than the usual “future work” fog.

First, the framework still needs predefined atomic primitives. The authors show that DERL is not very sensitive to primitive removal or even a reversed primitive in their experiments, but that does not mean primitive design is irrelevant. A primitive library defines what the optimizer can see. If no primitive captures a key business constraint, the Meta-Optimizer cannot compose it into the reward. A blind ingredient list does not become cuisine because the chef is trained.

Second, the inner loop is the computational bottleneck. Each outer-loop candidate reward requires training an inner policy and evaluating it. The paper uses parallel rollouts, vLLM for evaluation, and a lightweight 0.5B Meta-Optimizer, but it still acknowledges that the process is resource-intensive. DERL-pop reduces cost by reusing evolved policies, and the authors suggest lighter inner-loop methods or proxy tasks as future efficiency improvements. For most businesses, this matters more than the algorithmic elegance. Compute invoices are very effective at killing poetry.

Third, the evidence is benchmark evidence. ALFWorld, ScienceWorld, GSM8K, and MATH are useful because they allow controlled evaluation. They are not the same as live enterprise environments with messy data, changing procedures, ambiguous user intent, and legal accountability. The paper supports the direction; it does not certify deployment.

Fourth, autonomous reward evolution needs interpretability and safety checks. The paper’s stable-structure analysis is encouraging, but mathematical stability is not the same as value alignment. A reward can be bounded, smooth, and wrong. Production systems would need inspection of generated reward structures, adversarial testing, rollback mechanisms, and human governance around which primitives are allowed.

The correct practical stance is neither fear nor hype. DERL is a research framework that makes reward discovery more learnable. Turning that into enterprise infrastructure requires engineering around compute, validation design, primitive governance, and failure monitoring.

The strategic lesson: reward design becomes a product surface

For AI product builders, the paper points to a broader shift. As agents become more operational, prompts are no longer the only product surface. Memory policies, tool permissions, evaluation suites, orchestration rules, and reward functions become part of the product architecture. DERL adds one more layer: the system that improves the reward logic itself.

That changes how teams should think about agent development.

The early phase of an agent project asks, “Can the model complete the task?” The later phase asks, “Can the system keep improving without corrupting its own incentives?” DERL belongs to the second question. It is less about making the first demo look clever and more about giving the optimization process something better than sparse applause and handcrafted superstition.

The paper’s most useful contribution is therefore not the highest number in Table 1, although some of those numbers are impressive. It is the mechanism: reward design can be structured, validated, and learned through an outer loop. The agent learns under a reward; the reward designer learns from the agent’s validation performance.

That is why the title practically writes itself. The reward learns back.

And if that sounds like evolution with gradients, that is because this time, the evolutionary process is not just throwing mutations at the wall and calling the stains innovation. It is learning which walls matter.

Cognaptus: Automate the Present, Incubate the Future.

Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, and Difan Zou, “Differentiable Evolutionary Reinforcement Learning,” arXiv:2512.13399v2, 13 May 2026, https://arxiv.org/abs/2512.13399. ↩︎

The old reward problem is not scarcity; it is credit assignment#

DERL turns reward evolution into a bi-level learning loop#

The reward search space is constrained on purpose#

The main evidence is strongest where reward design is hardest#

The results table is not the full argument#

The mechanism evidence explains why DERL generalizes better#

DERL-pop is a curriculum mechanism, not just a variant name#

The business value is adaptive incentive design for agents#

The deployment boundary is compute, primitives, and safety#

The strategic lesson: reward design becomes a product surface#