Safety First, Reward Second — But Not Last

Opening — Why this matters now

Reinforcement learning has spent the last decade mastering games, simulations, and neatly bounded optimization problems. Reality, inconveniently, is none of those things. In robotics, autonomous vehicles, industrial automation, and any domain where mistakes have real-world consequences, almost safe is simply unsafe.

Yet most “safe RL” methods quietly rely on a compromise: allow some violations, average them out, and hope the system behaves. This paper refuses that bargain. It treats safety as a hard constraint, not a tunable preference—and then asks an uncomfortable question: can we still learn anything useful?

Background — The false choice between safety and performance

Classic constrained RL methods fall into two camps:

Lagrangian approaches that trade reward against penalties. These tend to oscillate, over-correct, or collapse into timid policies once penalties grow large.
Projection and trust-region methods (notably CPO) that enforce feasibility at every step. They are safe—but often so conservative that learning effectively stalls.

Both approaches implicitly assume that safety and reward must be optimized separately, or at least sequentially. When constraints are violated, reward learning is paused. When reward is pursued, safety waits its turn.

The result is familiar to anyone who has trained agents in Safety Gymnasium: policies that either behave recklessly or hug the walls forever.

Analysis — What SB-TRPO actually changes

The core idea of Safety-Biased Trust Region Policy Optimisation (SB-TRPO) is deceptively simple: never stop learning reward, but never stop reducing risk either.

Instead of switching between “reward mode” and “recovery mode,” SB-TRPO performs every update inside a trust region using a convex combination of two natural gradients:

a reward-improving direction
a cost-reducing (safety-improving) direction

A single parameter, the safety bias β, determines how much of the update budget must be spent on reducing cost. Crucially, this requirement is defined relative to the best possible cost reduction within the trust region, not as an absolute threshold.

In other words: the algorithm always makes progress toward safety, but only insists on a fraction of the maximum possible safety gain—leaving room for reward improvement when the gradients are aligned.

Implementation — Hard constraints without hard resets

Mathematically, the method relaxes classical CPO by replacing “stay feasible at all times” with “reduce expected cost monotonically.” This subtle shift has large consequences:

No explicit recovery phase
No binary switching logic
No dependence on arbitrary cost thresholds

The final update is selected as the largest reward-improving step that still guarantees a minimum cost decrease. When safety gradients vanish, the algorithm naturally degenerates into standard TRPO.

Practically, SB-TRPO avoids fragile second-order solvers by computing reward and cost steps separately via conjugate gradients, then blending them analytically. A standard TRPO-style line search enforces KL and cost conditions.

Findings — Safety and reward, quantified

The authors evaluate SB-TRPO on the hardest Safety Gymnasium benchmarks, using a zero-cost threshold throughout. Three metrics matter most:

Metric	What it captures
Safety probability	Fraction of episodes with zero violations
Safe reward	Return achieved only in safe episodes
SCR	Combined safety–cost–reward score

Across navigation and velocity-control tasks, SB-TRPO consistently lands on the Pareto frontier: no competing method achieves higher reward without increasing violations, or lower violations without sacrificing reward.

Notably, many baselines achieve low cost by abandoning the task entirely. SB-TRPO does not.

Implications — Why this matters beyond benchmarks

This paper quietly reframes how hard constraints should be handled in learning systems:

Zero-cost constraints are not pathological—they are often the correct formulation.
Safety should shape every update, not just recovery phases.
Smooth trade-offs outperform binary enforcement.

For practitioners designing autonomous systems, the lesson is clear: safety mechanisms that freeze learning are not safe—they merely defer failure.

Limitations — No miracles promised

SB-TRPO does not guarantee absolute safety in all environments, especially under sparse or noisy cost signals. Its guarantees are local and asymptotic, and depend on gradient estimation quality.

It is also tailored to hard constraints; soft-budget CMDPs may prefer different tools. But as a blueprint for principled safety-aware learning, the contribution is substantial.

Conclusion — Safety as a first-class optimization objective

SB-TRPO shows that safety and performance do not need to be traded on alternating timesteps. By embedding safety directly into the geometry of policy updates, the algorithm achieves what many safe RL methods only promise: agents that remain cautious and capable.

Not flashy. Just correct.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The false choice between safety and performance#

Analysis — What SB-TRPO actually changes#

Implementation — Hard constraints without hard resets#

Findings — Safety and reward, quantified#

Implications — Why this matters beyond benchmarks#

Limitations — No miracles promised#

Conclusion — Safety as a first-class optimization objective#