When Safety Stops Being a Turn-Based Game

Opening — Why this matters now

LLM safety has quietly become an arms race with terrible reflexes.

We discover a jailbreak. We patch it. A new jailbreak appears, usually crafted by another LLM that learned from the last patch. The cycle repeats, with each round producing models that are slightly safer and noticeably more brittle. Utility leaks away, refusal rates climb, and nobody is convinced the system would survive a genuinely adaptive adversary.

The paper “Safety Alignment of LMs via Non-cooperative Games” argues that this is not a tooling problem. It is a game design problem.

If safety alignment is treated as a sequential cat-and-mouse loop, we should not be surprised when it converges poorly. Real attackers adapt continuously. Safety training, until now, mostly does not.

Background — Context and prior art

Most modern safety pipelines follow a familiar structure:

Collect harmful prompts (manual or automated).
Train or fine-tune a Defender model to refuse or deflect.
Repeat when new attacks appear.

Recent work has automated step (1) using Attacker LMs that generate jailbreak prompts, improving coverage but not fixing the underlying issue. The training process remains sequential and alternating: attacker improves, defender catches up, attacker shifts again.

Some recent approaches tried self-play, where a single model alternates roles. This works well in cooperative domains (math, games), but the paper makes a blunt observation: self-play entangles incentives. When attacker and defender share parameters, gradients leak, exploration collapses, and the attacker often becomes… safer.

Safety is not chess. It is asymmetric, misaligned, and stubbornly non-zero-sum.

Analysis — What the paper actually does

The core idea is deceptively simple: train the Attacker and Defender jointly, as separate agents, inside a non-cooperative game.

The game structure

Attacker LM
- Rewrites benign prompts to induce over-refusal.
- Rewrites harmful prompts to induce compliance.
- Must stay faithful to the original intent.
Defender LM
- For benign prompts: maximize compliance and usefulness.
- For harmful prompts: maximize deflection (not just refusal).

Crucially, their objectives are not exact opposites. The attacker is not rewarded for gibberish or denial-of-service outputs. It is rewarded for semantic misclassification: making harmful look benign and benign look harmful.

This choice alone avoids a large class of degenerate equilibria.

Preference-based rewards, not scalar scores

Instead of asking a judge model to assign fragile numerical scores, the system relies on pairwise preferences:

Which response is more compliant?
Which response is better at deflection?

This is a quiet but important design decision. Relative judgments are far harder to hack than absolute scores, and they align better with how humans actually evaluate safety.

Online, off-policy, and stable

Training uses preference optimization (DPO / IPO variants) with EMA-based off-policy sampling. This matters because:

Pure on-policy RL oscillates badly in adversarial settings.
EMA smooths strategy updates, allowing the game to approach a stable equilibrium instead of chasing noise.

In short: this is adversarial training that does not self-destruct halfway through.

Findings — What actually improved

The results are unusually clean for a safety paper.

Utility vs Safety (simplified view)

Method	Utility Drop	Harmful ASR	Over-Refusal
Original	—	High	Low
Self-RedTeam	Noticeable	Medium	High
AdvGame (DPO/IPO)	Minimal	Low	Low

Across multiple benchmarks (HarmBench, WildJailbreak, DAN, XSTest), AdvGame shifts the Pareto frontier instead of sliding along it.

The Defender becomes harder to jailbreak without becoming useless.

The unexpected bonus: a real red-teamer

The trained Attacker model is not discarded.

It converges into a general-purpose red-teaming agent whose attack success rates rival established jailbreak methods (PAIR, TAP, even GCG in some settings). Unlike handcrafted attacks, it adapts naturally to new targets.

This quietly reframes red-teaming from a dataset problem into a model artifact.

Implications — Why this matters beyond benchmarks

Three implications stand out.

1. Safety alignment is a systems problem, not a filter problem

The paper makes it harder to pretend that more guardrails or better refusal templates will save us. Robust safety emerges from interaction dynamics, not static rules.

2. Non-zero-sum framing is not academic pedantry

By refusing the zero-sum assumption, the authors avoid trivial equilibria where the attacker wins by breaking language itself. This is a lesson that will generalize to:

Tool-using agents
Multi-agent LLM systems
Long-horizon autonomous workflows

3. Preference signals are quietly winning

Scalar reward models look precise, but they are brittle. Pairwise preference learning continues to outperform in the messiest parts of alignment — exactly where safety lives.

Conclusion — From whack-a-mole to equilibrium

This paper does not claim to “solve” LLM safety. It does something more valuable: it changes the framing.

Safety alignment stops looking like an endless patch cycle and starts looking like what it actually is — a strategic interaction between adaptive agents with misaligned goals.

Once you accept that, sequential training looks naïve. Self-play looks confused. And non-cooperative games start to look inevitable.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

The game structure#

Preference-based rewards, not scalar scores#

Online, off-policy, and stable#

Findings — What actually improved#

Utility vs Safety (simplified view)#

The unexpected bonus: a real red-teamer#

Implications — Why this matters beyond benchmarks#

1. Safety alignment is a systems problem, not a filter problem#

2. Non-zero-sum framing is not academic pedantry#

3. Preference signals are quietly winning#

Conclusion — From whack-a-mole to equilibrium#