When Rewards Learn to See: Teaching Humanoids What the Ground Looks Like

Robots do not fall because the word “walk” is ambiguous.

They fall because the ground has opinions.

A flat floor, a gap, a pile of blocks, and a staircase may all ask for “locomotion,” but they do not ask for the same behavior. One asks for velocity tracking. Another asks for foot placement. Another punishes careless exploration. A staircase, because it has a flair for drama, asks the robot to negotiate gravity one step at a time.

That distinction is the useful center of E-SDS – Environment-aware See it, Do it, Sorted, a paper on automated reward generation for humanoid locomotion.¹ The paper is not merely saying that humanoid policies should receive perceptive sensor inputs. Robotics people already know that. It is making a sharper claim: the reward-generation process itself must understand the terrain.

That sounds like a small architectural adjustment. It is not. It changes what automation is allowed to automate.

The mistake is treating reward generation as if the world were flat

Automated reward generation has become an attractive idea in reinforcement learning because manual reward engineering is slow, brittle, and surprisingly artisanal. A human expert writes a reward function, tunes weights, observes a policy doing something stupid, adjusts the reward, and repeats the ritual until the robot either learns the task or the researcher learns humility.

Vision-language models offer a tempting shortcut. Show the model a video demonstration. Ask it to identify the skill. Generate executable reward code. Train the policy. Refine the reward based on observed failure modes.

That is the “See it, Do it, Sorted” idea from prior work. E-SDS extends it.

The key problem is that a video of a gait does not fully define a locomotion task. A demonstration tells the system something about movement: contact sequence, gait rhythm, posture, target velocity. It does not by itself tell the system what terrain hazards the robot must respond to.

A policy trained from such a reward can be competent in a regular environment and useless in a discontinuous one. It has learned the dance, not the floor.

E-SDS adds a missing layer: before generating the reward, it analyzes the target environment quantitatively. The system deploys 1,000 simulated robots for a short data-collection period, gathers sensor readings, and summarizes the terrain using measures such as obstacle density, gap ratios, and terrain roughness. That environmental summary is then fed into the reward-generation prompt together with the video-derived behavioral analysis.

The mechanism can be stated plainly:

Component	What it contributes	Why it matters
Video demonstration analysis	Extracts gait, contacts, posture, and task requirements	Defines the intended behavior
Environment analysis	Computes terrain statistics from sensor data	Defines what the behavior must survive
Reward code generation	Produces executable Python reward functions	Converts interpretation into trainable objectives
PPO policy training	Trains candidate policies in Isaac Lab	Tests whether the reward actually works
Automated feedback and refinement	Identifies failures such as freezing or falling	Updates the reward without manual tuning

The important move is not just adding perception to the policy. The manually designed baseline in the paper also has sensor access. The important move is adding environmental perception to the reward-design loop.

A robot can have eyes and still be rewarded for behaving as if the floor were a spreadsheet.

E-SDS makes the reward writer environment-aware, not just the robot

The paper formulates the task as perceptive humanoid locomotion under partial observability. The Unitree G1 humanoid receives a high-dimensional observation vector containing both proprioceptive state and exteroceptive data. The proprioceptive part covers joint positions, velocities, and base orientation. The exteroceptive part includes processed height-scanner data and LiDAR measurements.

That means the policy can, in principle, see the terrain.

But “in principle” does a lot of unpaid labor here. Sensor access does not automatically produce useful behavior. Reinforcement learning still needs a reward that makes the right use of those inputs. If the reward does not value safe gap negotiation, careful obstacle traversal, or controlled stair descent, the policy may ignore the information, exploit the reward, freeze, or fall forward with confidence. This is robotics, so all four are possible.

E-SDS therefore treats reward generation as conditional code generation. The model is not asked to write a generic locomotion reward. It is asked to write a reward for a behavior in a particular environment, with specific terrain statistics available.

This is the mechanism-first reason the paper is more interesting than a simple “VLM improves robotics” story. The VLM is not magical fairy dust sprinkled over reinforcement learning. It is being used as an automated reward engineer whose prompt now includes facts about the world the policy must operate in.

The pipeline runs for three refinement iterations. In each iteration, the system generates candidate reward functions, trains policies with PPO in massively parallel simulation, evaluates the policies using quantitative metrics and rollout footage, and sends feedback into the next reward-generation round. The full process takes about 99 minutes per terrain.

That number matters. The business relevance is not “robots are now solved.” Please. The relevance is that a costly expert workflow—manual reward design and tuning—may become partly automatable for terrain-specific simulation training.

The unit of value is not only better locomotion. It is faster diagnosis.

The experiments test three different claims, not one big victory lap

The paper evaluates E-SDS in Isaac Lab on four terrain types: simple bumps, gaps, obstacles, and stairs. It compares three policies:

Policy	Reward generation	Policy perception	What it tests
E-SDS	Automated and environment-aware	Height scanner and LiDAR	Main proposed system
Foundation-Only	Automated but not environment-aware	Proprioception only	Ablation of environment awareness
Manual Baseline	Human-designed reward terms	Height scanner and LiDAR	Comparison with perceptive manual reward engineering

This comparison structure is useful because it separates two ideas that readers may casually merge.

One idea is automated reward generation. Another is perceptive locomotion. E-SDS argues that useful automation for complex terrain needs both. The Foundation-Only policy tests what happens when reward automation lacks environmental awareness. The Manual Baseline tests whether sensor access plus hand-designed perceptive rewards is enough.

The answer is: sometimes stable, often conservative, and not necessarily task-effective.

Stairs expose the difference between seeing and using what is seen

The stair experiment is the clearest case because it produces a behavioral split rather than a mild metric difference.

On stair descent, E-SDS is the only policy that successfully learns to descend the stairs. It achieves zero torso contacts and a velocity tracking error of 0.663 m/s. The Manual Baseline also records zero torso contacts, but mainly by remaining stationary at the top of the stairs, with a low exploration score of 3.495 and a velocity tracking error of 2.278 m/s. The Foundation-Only policy moves forward but falls badly, with a torso contact rate of 333.466.

Metric on stair terrain	E-SDS	Foundation-Only	Manual Baseline
Locomotion quality	0.412	0.342	0.393
Exploration score	10.930	11.109	3.495
Torso contact rate	0.000	333.466	0.000
Velocity tracking error (m/s)	0.663	0.727	2.278

This table needs careful reading. Foundation-Only has a high exploration score, slightly higher than E-SDS. That does not mean it performs better. It means it covers area while falling. In locomotion evaluation, movement without safety is not courage. It is just gravity with extra steps.

The Manual Baseline is the opposite failure mode. It is safe because it is inactive. The policy avoids catastrophic contact, but it does not solve the task. This is a common pathology in reward design: if the penalty for failure is strong and the incentive for progress is weak or poorly shaped, the robot discovers that doing nothing is a respectable career.

E-SDS avoids both traps in this setting. It moves and remains upright. That combination is the real result.

The likely purpose of this experiment is main evidence. It supports the paper’s central claim that environment-aware reward generation can produce behaviors that neither perception-blind automation nor manually tuned perceptive rewards achieve in this setup. It does not prove real-world stair descent, multi-terrain generalization, or deployment readiness.

Gaps and obstacles show the cost of conservative rewards

The gap and obstacle terrains give a subtler lesson. E-SDS does not simply dominate every safety metric. Instead, it trades some risk for active navigation.

On gap terrain, E-SDS has a velocity tracking error of 0.660 m/s and an exploration score of 10.886. The Manual Baseline has a worse velocity tracking error of 1.373 m/s and lower exploration score of 6.447, but also a lower torso contact rate: 1.492 compared with E-SDS at 5.136. The Foundation-Only policy performs poorly on safety, with a torso contact rate of 140.476.

Metric on gap terrain	E-SDS	Foundation-Only	Manual Baseline
Velocity tracking error (m/s)	0.660	0.577	1.373
Exploration score	10.886	5.170	6.447
Torso contact rate	5.136	140.476	1.492

The paper interprets the Manual Baseline as adopting a conservative avoidance strategy, while E-SDS actively navigates the terrain. That interpretation is plausible because the baseline stays safer but explores less and tracks velocity worse.

For business readers, this is more important than the headline performance claim. An automated system can generate a policy that is more operationally useful even if one isolated safety metric is not always best. The right question is not “Which number is lower?” The right question is “Which policy solves the operational task under acceptable risk?”

The obstacle terrain has a similar pattern, though E-SDS also improves torso contact relative to the Manual Baseline:

Metric on obstacle terrain	E-SDS	Foundation-Only	Manual Baseline
Velocity tracking error (m/s)	0.492	0.621	2.058
Exploration score	7.825	5.873	5.870
Torso contact rate	37.920	316.980	46.130

Here, E-SDS improves velocity tracking, exploration, and torso contact rate relative to the Manual Baseline. The Foundation-Only policy again shows the weakness of perception-blind automation on cluttered terrain.

These experiments serve as main comparative evidence and as behavioral interpretation. They show not only that E-SDS scores better on several metrics, but that the learned strategy differs: active navigation instead of freezing, falling, or avoiding the task.

The simple terrain result says the mechanism is not only for dramatic terrain

On the simple terrain, E-SDS achieves a velocity tracking error of 0.387 m/s, compared with 0.549 for Foundation-Only and 2.225 for the Manual Baseline. Exploration scores are almost identical for E-SDS and the Manual Baseline, and both have zero torso contacts.

Metric on simple terrain	E-SDS	Foundation-Only	Manual Baseline
Velocity tracking error (m/s)	0.387	0.549	2.225
Exploration score	6.898	6.541	6.895
Torso contact rate	0.000	1.584	0.000

The simple-terrain result is useful because it prevents a too-narrow reading of the paper. E-SDS is not only a hazard-avoidance add-on for exotic terrain. Even when the environment is comparatively easy, automated reward generation plus iterative refinement improves command following.

But this result should not be overread. Simple terrain is still simulated terrain, using a specific humanoid platform and task setup. It supports the claim that the pipeline can improve reward quality under controlled conditions. It does not support a universal claim that automated rewards will outperform human-designed rewards across all locomotion tasks.

That is the boring sentence. Unfortunately, boring sentences are often where correct interpretation lives.

The ablation is the paper’s hinge, not a decorative appendix

The direct ablation compares E-SDS with Foundation-Only. Its likely purpose is to isolate the necessity of environmental awareness in reward generation.

The Foundation-Only policy uses automated reward generation without environmental analysis and operates without exteroceptive perception. Across complex terrains, this policy has much higher torso contact rates: 140.476 on gaps, 316.980 on obstacles, and 333.466 on stairs. E-SDS records 5.136, 37.920, and 0.000 respectively.

Terrain	E-SDS torso contact rate	Foundation-Only torso contact rate	Interpretation
Gaps	5.136	140.476	Blind automation fails on discontinuous ground
Obstacles	37.920	316.980	Terrain hazards require perceptive behavior
Stairs	0.000	333.466	Stair descent collapses without environment awareness

This is the paper’s hinge because it addresses the likely misconception directly. A reader may think: “Fine, use VLMs to generate rewards, and give the policy sensors. Done.”

Not quite.

The paper’s claim is that the reward generator must know what the sensors are for. Environment statistics steer the generated reward code toward terrain-relevant terms. Without that conditioning, the automated system can generate a reward that looks reasonable but fails to produce robust behavior where the ground becomes discontinuous or cluttered.

This is also why the Manual Baseline matters. It shows the inverse problem. A policy can have sensor access and manual perceptive rewards, yet still learn conservative or ineffective strategies. E-SDS is valuable because it closes the loop between environment analysis, reward code generation, training, evaluation, and refinement.

The automation is not one shot. It is corrective.

What businesses should actually take from this

The near-term business meaning is not “buy humanoid robots because VLMs can write rewards now.” That would be adorable, in the way a balance sheet made of confetti is adorable.

The practical implication is narrower and more useful: robotics development teams may be able to automate part of the reward-engineering workflow for terrain-specific simulated training.

That matters because reward engineering is a bottleneck. It consumes expert time, slows skill development, and makes each new terrain or behavior expensive to support. E-SDS suggests a workflow where the system can generate, test, and refine terrain-conditioned rewards in roughly under two hours per terrain, instead of requiring days of manual tuning.

A responsible business interpretation looks like this:

Paper result	Direct meaning	Cognaptus business inference	Boundary
E-SDS conditions reward generation on terrain statistics	Rewards are generated with environment context	Reward design can become a terrain-aware automation task	Current policies are terrain-specialized
E-SDS outperforms manual baseline on velocity tracking by 51.9–82.6%	Better command-following in tested simulation settings	Faster iteration may improve robotics R&D productivity	Not proven on hardware
Stair descent succeeds only under E-SDS	Environment-aware reward generation enables a difficult behavior in simulation	Automated reward loops may unlock tasks manual rewards fail to shape well	Stair descent result is specific to this setup
Foundation-Only falls frequently on complex terrains	Perception-blind automation is insufficient	VLM reward generation needs structured sensor/environment context	Does not isolate every possible perceptive reward variant
Pipeline takes about 99 minutes per terrain	Automated loop is operationally feasible in simulation	Useful for internal training pipelines and rapid prototyping	Still requires initial prompt setup and compute infrastructure

The ROI pathway is therefore indirect but real. A company working on legged robots, warehouse mobility, inspection robots, or terrain-adaptive autonomy may not deploy E-SDS as-is. But it can learn from the architecture: make the reward-design loop inspect the operating environment before asking an AI system to write objectives.

This is also relevant beyond humanoids. Any embodied AI system that learns through reinforcement learning faces the same design issue: the task objective is incomplete without context. A navigation robot in a hospital, a quadruped on industrial stairs, or a manipulation system in cluttered storage all need reward functions that encode what the environment makes difficult.

The slogan version: do not automate the reward writer while keeping it blind.

The limitations are about transfer, specialization, and setup

The paper is careful about several boundaries, and they materially affect business use.

First, the evaluation is simulation-only. Isaac Lab is useful precisely because it allows massively parallel training and controlled comparison. But successful simulation policies do not automatically become deployable hardware policies. The sim-to-real gap remains a major next step.

Second, E-SDS currently generates specialized policies for each terrain. That is acceptable for a research demonstration and potentially useful for controlled deployment domains, but it does not yet solve mixed real-world environments where the robot must generalize across terrain types without retraining per terrain.

Third, the pipeline is automated after setup, but not free of human involvement. The initial prompt setup still matters. That means the system reduces expert labor; it does not eliminate system design.

Fourth, the baseline comparisons are informative but not final. A manually designed reward baseline can always be challenged: was it tuned enough, was it the best possible manual reward, could a different expert produce a better one? That does not invalidate the paper. It simply means the correct claim is comparative within the paper’s setup, not universal superiority over all human reward engineering.

Finally, the results focus on a Unitree G1 humanoid across four simulated terrains. The mechanism may generalize, but the evidence is platform- and environment-specific.

These limitations do not make the paper weak. They make the paper useful. A result with boundaries can be used. A result without boundaries is usually a pitch deck trying to escape supervision.

The real contribution is teaching the automation where to look

E-SDS is best read as a mechanism paper about context-aware automation.

The system does not merely ask a VLM to imitate a video. It asks the VLM to generate reward code after seeing both the desired behavior and the terrain structure. It then trains policies, evaluates failures, and feeds those failures back into the next reward-generation iteration.

That is why the stair result matters, but it is not the whole story. The deeper point is that automated reward generation becomes more powerful when it is grounded in the environment that will punish the robot.

For robotics teams, the lesson is practical: the next productivity gain may not come from replacing every human reward engineer with a model. It may come from building better loops around the model—loops that collect environment statistics, expose failure modes, and force the reward generator to respond to the actual operating context.

A humanoid does not need a reward that says “walk nicely.”

It needs a reward that knows where the floor stops.

Cognaptus: Automate the Present, Incubate the Future.

Enis Yalcin, Joshua O’Hara, Maria Stamatopoulou, Chengxu Zhou, and Dimitrios Kanoulas, “E-SDS – Environment-aware See it, Do it, Sorted: Automated Environment-Aware Reinforcement Learning for Humanoid Locomotion,” arXiv:2512.16446, 2025. ↩︎

The mistake is treating reward generation as if the world were flat#

E-SDS makes the reward writer environment-aware, not just the robot#

The experiments test three different claims, not one big victory lap#

Stairs expose the difference between seeing and using what is seen#

Gaps and obstacles show the cost of conservative rewards#

The simple terrain result says the mechanism is not only for dramatic terrain#

The ablation is the paper’s hinge, not a decorative appendix#

What businesses should actually take from this#

The limitations are about transfer, specialization, and setup#

The real contribution is teaching the automation where to look#