When Rewards Learn to See: Teaching Humanoids What the Ground Looks Like

Opening — Why this matters now

Humanoid robots can now run, jump, and occasionally impress investors. What they still struggle with is something more mundane: noticing the stairs before falling down them.

For years, reinforcement learning (RL) has delivered impressive locomotion demos—mostly on flat floors. The uncomfortable truth is that many of these robots are, functionally speaking, blind. They walk well only because the ground behaves politely. Once the terrain becomes uneven, discontinuous, or adversarial, performance collapses.

The paper behind E‑SDS (Environment‑aware See it, Do it, Sorted) argues that this failure has less to do with policies and more to do with rewards. Specifically: reward functions have not been allowed to see the environment they are supposed to care about.

That turns out to be a rather expensive oversight.

Background — The two halves that never met

Modern humanoid locomotion research has been progressing along two largely separate tracks.

On one side, automated reward generation. Vision‑language models (VLMs) can now write reward code from natural language instructions or video demonstrations. This has dramatically reduced human effort. Unfortunately, these rewards are typically generated in isolation from the robot’s physical surroundings. The resulting policies are elegant, obedient—and oblivious.

On the other side, perceptive locomotion. Here, robots are given height maps, LiDAR, and other exteroceptive sensors so they can anticipate terrain changes. These systems can walk on stairs and gaps, but only after weeks of careful, manual reward tuning by experts who know exactly which sensor signal should be punished, encouraged, or softly nudged.

The gap is obvious in hindsight: automated rewards don’t perceive; perceptive controllers don’t automate. E‑SDS is an attempt to remove that mutual blindness.

Analysis — What E‑SDS actually does

E‑SDS reframes reward design as an environment‑conditioned code generation problem.

Instead of asking a language model to write a reward purely from a demonstration video, the system injects quantitative terrain statistics directly into the prompt. Before training even begins, hundreds of simulated robots briefly explore the target environment. From this, the system computes simple but informative descriptors:

obstacle density
gap ratios
terrain roughness

These statistics are then fused with a structured analysis of the demonstration video (gait, foot contacts, task intent). The VLM is no longer guessing what matters—it is told, numerically, what kind of ground the robot will face.

The output is executable reward code that explicitly references environmental sensors such as height scanners and LiDAR. In short: the reward function becomes perceptive before the policy does.

Crucially, E‑SDS does not stop there. Training runs in a closed loop:

Generate multiple candidate reward functions
Train policies under each
Evaluate performance and failure modes
Feed structured feedback back into the next reward generation round

After three iterations, the system converges on a reward function that no human explicitly designed—but that reliably produces environment‑aware behavior.

Findings — Results that are hard to ignore

Across four terrains of increasing difficulty (simple ground, gaps, obstacles, stairs), E‑SDS was evaluated against two baselines: manually tuned perceptive rewards, and an automated but perception‑blind reward system.

The headline results are unambiguous.

Terrain	Key Outcome
Simple	Velocity tracking error reduced by 82.6% vs manual baseline
Gaps	2× exploration with far fewer falls
Obstacles	Active navigation instead of conservative freezing
Stairs	Only E‑SDS successfully descended

The stair case is the most telling. The manual baseline—despite full sensor access—refused to move. The automated blind policy walked confidently… straight into repeated falls. E‑SDS was the only approach that learned to descend safely.

Equally important is efficiency. What typically takes days of expert reward tuning was completed in under two hours per terrain.

Implications — Why this matters beyond robotics labs

E‑SDS quietly shifts where intelligence is located in an RL system.

Instead of pushing all complexity into larger policies or richer observations, it invests intelligence upstream—into reward construction itself. This has several broader implications:

Scalability: If rewards can adapt to environments automatically, deploying robots across varied sites becomes less artisanal and more industrial.
Reliability: Many catastrophic failures stem from rewards that accidentally incentivize pathological shortcuts. Environment‑aware rewards reduce that risk.
Transferability: The same idea applies beyond locomotion—to manipulation, navigation, and even non‑robotic domains where context matters but is poorly specified.

There are limits, of course. E‑SDS currently trains one policy per terrain, and everything happens in simulation. Real‑world deployment and multi‑task generalization remain open problems. But the direction is clear.

Conclusion — Rewards are policies, just written earlier

E‑SDS demonstrates a simple but underappreciated truth: reward functions are not bookkeeping tools. They are policy priors, written in code instead of weights.

Once rewards are allowed to perceive the world they judge, a surprising amount of downstream complexity evaporates. Robots stop freezing. They stop guessing. Occasionally, they even manage stairs.

That is not just a robotics result. It is a reminder that in many AI systems, the most important intelligence lives before training begins.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The two halves that never met#

Analysis — What E‑SDS actually does#

Findings — Results that are hard to ignore#

Implications — Why this matters beyond robotics labs#

Conclusion — Rewards are policies, just written earlier#