Safety First, Reward Second — But Not Last

The safest robot in a factory is the one that never moves.

It will not collide with a worker, damage a component, cross a restricted boundary, or exceed a speed limit. Its incident statistics will be immaculate. Its productivity statistics will be less impressive.

This absurdly safe robot captures a genuine problem in reinforcement learning. When an agent is trained under strict safety constraints, an algorithm can reduce violations by teaching the agent to avoid doing anything difficult. The resulting policy may satisfy the safety department, at least on paper, while quietly failing the reason it was deployed.

A paper by Dominik Wagner, Ankit Kanwar, and Luke Ong examines this failure mode in model-free safe reinforcement learning and proposes Safety-Biased Trust Region Policy Optimisation, or SB-TRPO.¹ Its central argument is easy to misread.

The authors do not argue that hard safety requirements should be weakened so agents can earn more reward. They retain a zero-cost objective: unsafe events remain unacceptable. What they relax is the demand that every training update pursue the maximum possible safety improvement or immediately restore full feasibility.

That distinction—between the final safety standard and the path used to reach it—is where the paper becomes useful.

Zero violations are a specification, not a tuning preference

Safe reinforcement learning usually represents undesirable events through a cost signal. A robot receives reward for accomplishing its task and cost for actions such as contacting a hazard, leaving a permitted area, or exceeding a speed limit.

Many constrained reinforcement-learning methods then solve a problem resembling:

$$ \max_{\pi} J_r(\pi) \quad \text{subject to} \quad J_c(\pi) \leq d, $$

where $J_r(\pi)$ is expected reward, $J_c(\pi)$ is expected cost, and $d$ is an allowed cost threshold.

This formulation is sensible when the constraint is genuinely soft. Battery consumption, component wear, latency, and energy use can often be managed through budgets. A delivery robot may consume more power on one route provided that its average consumption remains acceptable.

Catastrophic events are different. A collision is not made acceptable because the quarterly average remains below 0.7.

For non-negative safety costs, the paper therefore formulates the hard-safety objective as:

$$ \max_{\pi} J_r(\pi) \quad \text{subject to} \quad J_c(\pi) = 0. $$

This zero-cost threshold is not simply an unusually strict hyperparameter. It represents a different problem specification: policies should avoid unsafe states entirely.

The distinction matters because a positive threshold combines two questions that should remain separate:

What behaviour is acceptable?
How should the algorithm learn that behaviour?

When a positive cost limit is introduced merely to help training progress, an optimisation convenience quietly becomes an authorised level of failure. The spreadsheet remains tidy. The robot remains capable of hitting things.

Three ways to teach an agent not to crash

The paper is easiest to understand by comparing three update philosophies.

Update philosophy	What happens when safety conflicts with reward	Typical failure in a hard-constraint setting
Penalty-based methods	Unsafe behaviour is discouraged by increasing its penalty	Penalties remain too weak and the agent stays unsafe, or become too strong and the policy collapses into conservatism
Feasibility-and-recovery methods	Reward is pursued while feasible; violations trigger cost-only recovery	Recovery produces a safe but unproductive policy that cannot easily escape
SB-TRPO	Every update must reduce cost by a required amount, while remaining capacity is used for reward	Better balance is possible, but temporary training violations and imperfect gradient estimates remain

The difference is not philosophical decoration. It changes which directions the policy is allowed to move during training.

Penalty methods price safety and hope the price is correct

Lagrangian approaches convert the constraint into a penalty. As costs rise, the algorithm increases the price attached to unsafe behaviour.

This works naturally when safety and reward can be traded continuously. It becomes awkward when the desired cost is exactly zero.

With a zero-cost target, the multiplier can keep increasing whenever any violation remains. If the penalty starts too low, the agent continues taking risks. If it becomes large, reward optimisation is overwhelmed and the agent may settle into a low-activity policy. Sparse cost signals make recovery harder because the agent receives little information about near-misses or safer alternatives.

The practical problem is not that penalty methods ignore safety. It is that they reduce a hard requirement to a pricing problem without any reliable market price.

CPO-style recovery can make safety the only job

Constrained Policy Optimization, or CPO, takes a more direct approach. Within a trust region—a limited neighbourhood of possible policy changes—it tries to improve reward while preserving feasibility.

If no feasible update exists, CPO enters a recovery phase and follows a cost-reducing direction. Reward improvement is temporarily abandoned.

This seems sensible. When a system is unsafe, fix safety first.

The difficulty appears after recovery succeeds. A policy that has learned to stay away from hazards may also have learned to stay away from the task. Improving reward from that position may require temporarily exploring actions that increase estimated cost. A strict feasibility-preserving update rejects those directions.

The agent becomes trapped near a safely unproductive policy.

The paper’s qualitative results give this failure mode a physical form. In several navigation tasks, CPO and related methods avoid interactive regions, remain near the centre of an area, move backwards, or simply fail to approach useful targets. Nothing catastrophic happens. Not much else happens either.

SB-TRPO keeps safety and reward inside the same update

SB-TRPO removes the separate recovery phase.

At each training step, it asks for a policy update that improves reward while achieving at least a required reduction in cost. That required reduction is not a fixed threshold. It is calculated as a fraction of the best cost reduction currently achievable within the trust region.

Let $c_k^\ast$ denote the lowest cost reachable from the current policy $\pi_k$ within the permitted trust region. SB-TRPO requires the next policy to reduce cost by:

$$ \epsilon_k = \beta \left(J_c(\pi_k) - c_k^\ast\right), $$

where the safety bias $\beta \in (0,1]$ controls how much of the locally optimal cost reduction must be achieved.

A higher $\beta$ demands more aggressive safety progress. A lower $\beta$ preserves more room for reward improvement.

The important point is what $\beta$ does not represent. It is not an acceptable amount of failure. The final target remains zero cost. The parameter governs the aggressiveness of each training update, not the safety requirement of the finished policy.

When $\beta = 1$, the algorithm insists on the maximum available cost reduction and recovers CPO-like behaviour. When $\beta < 1$, it still requires cost to fall, but it does not require every step to sacrifice all other progress in pursuit of the steepest possible reduction.

Safety remains first. Reward is merely permitted to remain in the meeting.

SB-TRPO relaxes the route, not the destination

The likely objection is immediate: if an intermediate policy is allowed to remain infeasible, has the algorithm not abandoned hard safety?

No. It has abandoned the assumption that every training-stage policy must already satisfy the deployment-stage requirement.

These are different standards.

A zero-cost deployment objective specifies where the optimisation process should end. Requiring every intermediate update to be fully feasible specifies how the process is allowed to travel. The second requirement is stronger, and it can eliminate paths that eventually produce better safe policies.

SB-TRPO instead requires controlled local progress:

$$ J_c(\pi_{k+1}) \leq J_c(\pi_k) - \epsilon_k. $$

In the paper’s idealised formulation, cost therefore decreases monotonically. When no further cost decrease is locally possible, the update improves reward. If neither cost nor reward can improve, the policy has reached a local optimum within the trust region.

The practical algorithm approximates this idea by calculating two trust-region directions:

$\Delta_r$, the reward-improving direction;
$\Delta_c$, the cost-reducing direction.

It then chooses an adaptive convex combination:

$$ \Delta = (1-\mu)\Delta_r + \mu\Delta_c, $$

with the smallest cost weighting $\mu$ that still satisfies the required local cost reduction.

This is more precise than simply averaging two gradients. The mixture changes according to their alignment and the safety progress required at that step.

When the reward direction already reduces cost sufficiently, the algorithm can take it with little or no adjustment. When reward and safety conflict, more of the cost-reducing direction is added. When they strongly conflict, safety receives the larger share.

The update remains inside a KL-divergence trust region so that the new policy does not move too far from the current one. A backtracking line search checks the trust-region condition and verifies empirical reduction in the surrogate cost objective before accepting the step.

The result is not “reward despite safety.” It is reward improvement within a continuing process of safety improvement.

What the theorem promises—and what sampling takes back

The paper provides local guarantees for the approximate update.

For sufficiently small steps, SB-TRPO reduces cost whenever the cost gradient is non-zero. It also improves reward whenever the reward and cost gradients are suitably aligned—specifically, when they are not pointing in adversarially conflicting directions.

This conditional reward guarantee matters because CPO-style recovery updates deliberately ignore reward outside the feasible region. SB-TRPO can continue improving reward even while the current policy remains infeasible.

The guarantee is local, however. It depends on exact gradients, smoothness assumptions, sufficiently small steps, and the accuracy of trust-region approximations. Practical reinforcement learning provides finite samples, noisy gradient estimates, sparse safety signals, and the usual collection of inconvenient facts that the theorem politely leaves outside.

The paper does not conceal the gap. Its training curves show temporary increases in cost, including cases where a previously near-feasible policy returns to infeasible behaviour. Sparse navigation costs are particularly difficult because the agent receives a cost only after touching a hazard; near-misses provide little warning.

This makes the theorem useful as a design explanation, not as a deployment certificate.

The benchmark measures safe usefulness, not safety alone

The main empirical study evaluates SB-TRPO against eleven model-free constrained reinforcement-learning baselines across eight unmodified, level-2 Safety Gymnasium tasks.

The tasks include navigation problems for Point and Car robots, plus velocity-control tasks for Hopper and Swimmer agents. Violations include contacting hazards, moving hazardous objects, leaving safe regions, and exceeding velocity limits.

Each method is trained for 20 million time steps across five seeded initialisations. Baseline cost limits are set to zero to target the same hard-safety objective.

The paper introduces two metrics that are especially appropriate for this setting:

Safety probability: the fraction of episodes completed with zero violations.
Safe reward: average return across all episodes, with the return from any unsafe episode replaced by zero.

Safe reward is deliberately unforgiving. A policy cannot compensate for unsafe episodes by earning large rewards during them. It receives credit only for productive behaviour completed safely.

This metric exposes the weakness of evaluating safety and reward separately. A policy can have high raw reward by taking risks, or high safety by avoiding the task. Neither is operationally attractive.

The main result is a consistently better compromise, not universal domination

SB-TRPO does not produce the highest safety probability or highest safe reward in every task. That is precisely why the result is worth reading carefully.

The paper’s claim is that SB-TRPO consistently avoids the two worst corners of the problem: productive but unsafe policies, and safe but useless ones.

Selected results illustrate the pattern:

Task and method	Safe reward	Safety probability	Interpretation
Point Push — SB-TRPO	0.33 ± 0.18	0.79 ± 0.08	Positive safe reward with comparatively high safety
Point Push — C3PO	-0.15 ± 0.18	0.85 ± 0.05	Slightly safer, but negative safe reward
Point Push — P3O	0.23 ± 0.12	0.78 ± 0.10	Similar safety, lower safe reward
Car Circle — SB-TRPO	7.5 ± 1.5	0.99 ± 0.01	Near-perfect safety with meaningful task performance
Car Circle — TRPO-Lagrangian	11 ± 4	0.81 ± 0.26	More reward, materially lower and more variable safety
Car Circle — CPO	0.96 ± 0.76	0.99 ± 0.03	Similar safety, far less useful behaviour
Swimmer Velocity — SB-TRPO	48 ± 26	0.98 ± 0.02	High safety with substantial safe reward
Swimmer Velocity — C3PO	78 ± 56	0.92 ± 0.08	Higher reward, lower safety and greater variability
Swimmer Velocity — CPO	-3.8 ± 9.3	0.99 ± 0.03	Very safe, but effectively unproductive

Car Circle is the cleanest example. TRPO-Lagrangian earns more safe reward, but its safety probability falls to 0.81 and varies substantially across seeds. CPO reaches approximately the same safety probability as SB-TRPO, but with only a small fraction of its safe reward.

SB-TRPO occupies the useful middle: almost all episodes are safe, and the agent still learns the intended circling behaviour.

Point Button shows the boundary more clearly. SB-TRPO achieves positive safe reward, but its safety probability is only $0.47 \pm 0.10$. CPO and C-TRPO achieve safety probabilities around 0.76, but with strongly negative safe rewards. None of these outcomes qualifies as solved hard safety.

The paper therefore demonstrates a better balance, not almost-sure safety across every difficult environment.

That is less dramatic than announcing that catastrophic-risk reinforcement learning has been solved. It is also more credible.

Gradient alignment explains why recovery methods stop working

Final benchmark numbers show what happened. The paper’s gradient-angle analysis helps explain why.

For Car Circle and Point Button, the authors measure the angle between each policy update and the reward gradient.

An angle below $90^\circ$ indicates that an update has a reward-improving component. An angle near $90^\circ$ is approximately indifferent to reward. An angle above $90^\circ$ points against reward improvement.

CPO and C-TRPO updates are frequently near or above $90^\circ$ relative to the reward gradient. They spend much of training in recovery-like behaviour: reducing cost while making little progress on the task, or actively moving away from reward.

SB-TRPO updates generally remain below $90^\circ$. In Car Circle, they cluster mostly around $60^\circ$ to $70^\circ$. At the same time, the updates remain aligned with the cost-reducing direction, although less aggressively than CPO or C-TRPO.

This is the mechanism the final results imply:

CPO-style methods frequently select updates that are excellent for recovery and poor for learning the task.
SB-TRPO selects updates that still reduce cost but preserve a meaningful reward-improving component.
The resulting policy learns safer ways to perform the task rather than merely learning to avoid it.

The qualitative videos reinforce this interpretation. On Point Button, SB-TRPO learns to circle the interactive region and navigate around moving obstacles before approaching the target. On Car Goal, it steers around hazards while remaining productive. Several conservative baselines instead avoid the interactive region.

This behavioural evidence is not a separate claim. It is a visible consequence of the update geometry.

The appendix tests four different claims

The supplementary experiments are useful, but they do not all carry the same evidential weight. Some test the central mechanism; others test sensitivity or implementation choices.

Test	Likely purpose	What it supports	What it does not prove
Main eight-task comparison	Main evidence	SB-TRPO offers a consistently strong safety–reward balance under zero-cost baselines	Real-world safety or universal dominance
Positive cost-limit comparison	Sensitivity test and comparison with relaxed specifications	Allowing a positive threshold can sharply reduce safety without reliable reward improvement	That every positive-threshold CMDP is inappropriate
Safety-bias sweep	Ablation and robustness test	$\beta$ predictably moves policies along a reward–safety frontier	That $\beta$ never needs task-specific validation
MC versus GAE comparison	Implementation ablation	Critics change the position on the frontier but not the basic mechanism	That critic-free training is always superior
Longer Point Button training	Exploratory robustness check	SB-TRPO can continue improving after baselines plateau	Guaranteed convergence on difficult tasks
Approximation-error analysis	Implementation validation	Practical approximations appear conservative and cost predictions are reasonably accurate	Exact theoretical guarantees under noisy training

Relaxing the cost limit often buys surprisingly little

The authors rerun selected baselines with a positive cost limit of 15.

This is a sensitivity test of the paper’s zero-cost formulation. If positive limits simply made learning easier while preserving meaningful safety, the argument for a dedicated hard-constraint method would weaken.

Instead, several baselines use the extra allowance by becoming substantially less safe.

On Car Circle, CPO moves from a safe reward of $0.96$ and safety probability of $0.99$ under a zero limit to a safe reward of $8.3$ and safety probability of $0.54$ under the relaxed limit. The agent becomes much more productive, but roughly half the evaluated episodes are no longer violation-free.

Other cases are less flattering. TRPO-Lagrangian on Car Circle falls from a safe reward of $11$ and safety probability of $0.81$ to a safe reward of $6.8$ and safety probability of $0.44$. The relaxed allowance produces less safety without even improving safe reward.

The result does not mean positive limits are inherently wrong. They remain appropriate for genuinely budgetable costs. It shows that a positive threshold is a poor substitute for a zero-violation objective when the underlying event is supposed to be unacceptable.

The safety bias behaves like a control knob, not a hidden target

The safety-bias ablation varies $\beta$ from 0.6 to 0.9 on Point Button and Car Goal.

As $\beta$ increases, safety generally rises while reward falls. The resulting policies trace an approximately linear Pareto frontier. The tested baselines commonly sit below this frontier or produce negative reward.

This supports the paper’s mechanism: $\beta$ predictably controls update aggressiveness.

It also reveals an operational reality. SB-TRPO removes the arbitrary positive cost threshold, but it does not remove judgement from training. Teams must still choose how aggressively each update should pursue safety progress.

The improvement is conceptual cleanliness. $\beta$ governs the learning process; it does not redefine how many catastrophic events are acceptable.

The critic-free variant is cheaper, but the compute claim is narrow

The main experiments use Monte Carlo advantage estimates without learned critics. On a single, non-parallelised Point Goal training run, the critic-free SB-TRPO update takes approximately $0.26 \pm 0.08$ seconds per epoch.

The GAE variant takes $8.12 \pm 0.95$ seconds. The tested baselines range from roughly 3.32 seconds for C-TRPO to more than 88 seconds for CPPO-PID, making the Monte Carlo variant at least an order of magnitude cheaper per update epoch than the baselines.

That is a meaningful implementation result. It is not evidence of a tenfold reduction in total project cost.

The comparison measures update time per epoch on one task under one implementation. It does not include the economic cost of simulation design, validation, failed experiments, real-world data collection, or additional safeguards. Nor does it establish superior sample efficiency.

The critic ablation also shows that cheaper is not automatically better. At the same $\beta=0.75$, GAE often achieves higher reward but lower safety than the Monte Carlo variant. Increasing the GAE safety bias to 0.8 produces a more comparable trade-off.

The useful conclusion is that critic-free SB-TRPO offers an inexpensive implementation path, while critic choice shifts where the policy lands on the safety–reward frontier.

The business lesson is to separate four safety decisions

The paper directly studies a reinforcement-learning algorithm in simulated benchmarks. Its business value lies less in the name of the optimiser than in the distinctions it forces decision-makers to make.

1. Separate unacceptable outcomes from budgetable costs

Organisations frequently place all constraints into the same optimisation framework because doing so is convenient.

That is a governance mistake.

Energy consumption, latency, maintenance wear, and inventory holding costs may tolerate positive budgets. Worker collisions, prohibited transactions, irreversible equipment damage, and certain regulatory breaches may not.

The first category can be priced. The second should be specified as a hard constraint, even when the learning system cannot yet satisfy it perfectly.

Otherwise, an algorithmic tuning parameter becomes an undeclared risk-acceptance policy.

2. Separate deployment safety from training-stage exploration

SB-TRPO may temporarily use infeasible intermediate policies while progressing toward a safer final policy.

That is acceptable in a simulator. It is not automatically acceptable on a factory floor.

For genuinely catastrophic constraints, businesses should train in simulation, digital twins, restricted test environments, or behind independent safety shields. A learning algorithm that tends toward safety is not a substitute for a system that prevents unsafe actions during learning.

The paper’s contribution is to improve the optimisation path. It does not eliminate the need to control where that path is travelled.

3. Measure safe productivity, not incidents alone

A system that records zero incidents because it refuses difficult tasks is not necessarily successful.

Safety dashboards should therefore pair violation metrics with measures of useful work completed safely. The paper’s safe-reward metric provides a simple pattern: assign no performance credit to outcomes obtained through unsafe episodes.

An operational analogue might measure:

orders fulfilled without intervention;
kilometres travelled without safety overrides;
tasks completed without entering prohibited states;
production output achieved without violating process limits.

This prevents teams from celebrating either reckless productivity or decorative safety.

4. Diagnose update conflict before adjusting penalties

When an agent repeatedly alternates between unsafe productivity and safe inactivity, the problem may not be a missing penalty coefficient. Reward and safety gradients may be sending the policy in conflicting directions, while the optimiser handles that conflict poorly.

Gradient-alignment diagnostics can reveal whether training updates are:

helping both objectives;
improving safety while ignoring reward;
improving reward while increasing risk;
or moving against both.

This is more informative than watching a single average cost curve and repeatedly adjusting penalty weights until the graph becomes less embarrassing.

What the paper directly shows, what businesses may infer, and what remains uncertain

Level	Conclusion
Directly shown by the paper	Across eight simulated Safety Gymnasium tasks, SB-TRPO consistently achieves a strong balance between zero-violation episode probability and useful reward, while avoiding a separate cost-only recovery phase
Reasonable business inference	Hard safety requirements should remain separate from training-control parameters, and safe productivity should be evaluated alongside incident rates
Still uncertain	Whether SB-TRPO maintains its advantage on real robots, under distribution shift, with multiple interacting constraints, or when rare catastrophic events are poorly represented in training

This separation matters because safe-AI research is unusually vulnerable to hopeful extrapolation. A simulator can show that one optimiser handles a benchmark more effectively. It cannot certify an autonomous system for an environment it has never encountered.

Hard safety remains unsolved

SB-TRPO improves how a model-free agent approaches a hard constraint. It does not guarantee hard safety in practice.

Several boundaries should remain visible.

First, the experiments use simulated Safety Gymnasium tasks. These are valuable standardised benchmarks, but they do not reproduce the full uncertainty, sensor failure, mechanical wear, human unpredictability, or distribution shift of deployed autonomous systems.

Second, the theoretical guarantees are local and rely on sufficiently small updates and accurate gradients. Practical training uses estimates. The paper documents temporary cost increases and returns to infeasibility, especially when cost signals are sparse.

Third, SB-TRPO targets zero-cost hard constraints. It is not designed as a general replacement for methods handling positive cost budgets.

Fourth, the algorithm does not achieve almost-sure safety on every challenging benchmark. Point Button remains an obvious example: positive safe reward is obtained, but fewer than half of evaluated episodes are violation-free.

Finally, a zero-cost objective cannot compensate for an incomplete definition of cost. If the system does not observe a near-miss, hidden hazard, or downstream consequence, the optimiser cannot reduce it. Precisely optimising the wrong safety signal remains a remarkably efficient way to be unsafe.

For real deployment, SB-TRPO would therefore belong inside a wider safety architecture: simulation, shielding, independent monitors, fail-safe controls, staged testing, and human governance.

Safety first should not mean reward never

The paper’s most useful contribution is not a claim that safety and performance are naturally compatible. Frequently, they are not.

Its contribution is a better rule for handling the conflict.

Penalty methods risk treating catastrophic failure as a priceable inconvenience. Feasibility-recovery methods can make the agent so cautious that it stops performing the task. SB-TRPO preserves the zero-cost destination while allowing every update to reduce risk without automatically discarding reward progress.

That is a narrower achievement than solving safe reinforcement learning. It is also a more practical one.

A robot that never moves may remain the safest robot in the factory. SB-TRPO asks whether the second-safest robot can finally do some work.

Cognaptus: Automate the Present, Incubate the Future.

Dominik Wagner, Ankit Kanwar, and Luke Ong, “SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints,” arXiv:2512.23770, https://arxiv.org/abs/2512.23770. ↩︎

Zero violations are a specification, not a tuning preference#

Three ways to teach an agent not to crash#

Penalty methods price safety and hope the price is correct#

CPO-style recovery can make safety the only job#

SB-TRPO keeps safety and reward inside the same update#

SB-TRPO relaxes the route, not the destination#

What the theorem promises—and what sampling takes back#

The benchmark measures safe usefulness, not safety alone#

The main result is a consistently better compromise, not universal domination#

Gradient alignment explains why recovery methods stop working#

The appendix tests four different claims#

Relaxing the cost limit often buys surprisingly little#

The safety bias behaves like a control knob, not a hidden target#

The critic-free variant is cheaper, but the compute claim is narrow#

The business lesson is to separate four safety decisions#

1. Separate unacceptable outcomes from budgetable costs#

2. Separate deployment safety from training-stage exploration#

3. Measure safe productivity, not incidents alone#

4. Diagnose update conflict before adjusting penalties#

What the paper directly shows, what businesses may infer, and what remains uncertain#

Hard safety remains unsolved#

Safety first should not mean reward never#