When Rewards Learn Back: Evolution, but With Gradients

Opening — Why this matters now

Reinforcement learning has always had an uncomfortable secret: most of the intelligence is smuggled in through the reward function. We talk about agents learning from experience, but in practice, someone—usually a tired engineer—decides what “good behavior” numerically means. As tasks grow longer-horizon, more compositional, and more brittle to specification errors, this arrangement stops scaling.

The paper Differentiable Evolutionary Reinforcement Learning (DERL) arrives at precisely this fault line. Its claim is not incremental. It argues that reward functions themselves should be learned, not handcrafted, and that evolutionary search does not need to be blind. Evolution, the authors suggest, can learn gradients too.

Background — From reward hacking to the Bitter Lesson

The reward-design problem is well-known. Sparse outcome rewards are too weak to guide complex reasoning. Dense human-designed rewards invite reward hacking. RLHF scales poorly and embeds human priors that age badly—an observation Sutton famously summarized as the Bitter Lesson.

Evolutionary approaches tried to escape this trap by mutating rewards, prompts, or agent configurations. But classical evolution treats reward design as a black box: mutate, evaluate, select, repeat. No causal structure is learned. No direction is remembered. The system explores, but it does not understand why something worked.

DERL positions itself as a bridge between evolutionary search and gradient-based learning—without pretending that the full system is end-to-end differentiable.

Analysis — What DERL actually does

At its core, DERL is a bi-level optimization framework:

Inner loop: A policy model is trained under a candidate reward function.
Outer loop: A Meta-Optimizer (an LLM) evolves the reward function itself, using the validation performance of the inner policy as its learning signal.

The key shift is this: the outer loop is trained with reinforcement learning. Validation accuracy is treated as reward. Over time, the Meta-Optimizer learns which reward structures reliably produce better agents.

Structured rewards, not free-form text

DERL avoids an intractable search space by restricting rewards to symbolic compositions of atomic primitives:

Outcome correctness
Format validity
Partial process signals
Temporal segments of trajectories

Instead of emitting scalar scores, the Meta-Optimizer generates reward programs: weighted, compositional structures built from these primitives. This design forces learning at the level of structure, not surface text.

Why this is “differentiable” evolution

Nothing here magically backpropagates through environments. The differentiability is conceptual, not literal.

Because the Meta-Optimizer is a parameterized policy trained via RL, changes in validation performance induce policy-gradient updates. Over many outer-loop iterations, the system approximates a meta-gradient: how changes in reward structure affect downstream task success.

Evolution stops being amnesiac.

Findings — Performance, robustness, and structure

1. State-of-the-art results where heuristics fail

DERL is evaluated on:

ALFWorld (embodied agents)
ScienceWorld (scientific reasoning)
GSM8k / MATH (mathematical reasoning)

Across the board, DERL outperforms:

Pure outcome rewards
Averaged heuristic rewards
Prior structured reward methods

The gap is most dramatic in out-of-distribution settings, where heuristic rewards collapse and DERL remains stable.

Domain	Key Advantage	Observation
Robotics	O.O.D. robustness	>2× success over outcome-only rewards
Science	Curriculum emergence	Population-based rewards adapt over time
Math	Signal fidelity	Improves accuracy without reward hacking

2. Reward structures evolve toward stability

One of the paper’s most revealing analyses tracks the types of reward programs DERL generates over time.

Early on, the system explores unstable constructions—multiplicative chains that zero out easily or explode numerically. As training progresses, these disappear. Linear, normalized, bounded reward structures dominate.

No human constraint enforces this. The Meta-Optimizer discovers that numerically stable rewards are a prerequisite for reliable learning.

Evolution, left alone, becomes conservative.

Implications — What DERL changes conceptually

DERL reframes reward design as:

A learning problem, not an engineering chore
A structural discovery process, not a scalar regression task
A computational substitute for human priors, not an assistant to them

For agent builders, the message is blunt: if you are still hand-tuning reward weights, you are doing artisanal labor in an industrial problem.

At the same time, DERL is honest about its costs. Bi-level optimization is expensive. The choice of atomic primitives still matters. Long-horizon credit assignment remains hard.

But the direction is clear.

Conclusion — Evolution grows a memory

Differentiable Evolutionary Reinforcement Learning does not just improve benchmarks. It alters the conceptual contract between agents and objectives.

Rewards no longer sit outside the learning system as static laws. They become adaptive artifacts—shaped, tested, and refined by experience. Evolution stops guessing. It starts learning why.

This is not the end of reward engineering. It is its quiet demotion.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From reward hacking to the Bitter Lesson#

Analysis — What DERL actually does#

Structured rewards, not free-form text#

Why this is “differentiable” evolution#

Findings — Performance, robustness, and structure#

1. State-of-the-art results where heuristics fail#

2. Reward structures evolve toward stability#

Implications — What DERL changes conceptually#

Conclusion — Evolution grows a memory#