Jailbreaks are not polite enough to wait their turn.
That is the awkward weakness in many safety-training pipelines. A model is attacked, patched, tested, and released. Then another attack appears, usually crafted with more creativity than the previous defense assumed. The safety team patches again. The benchmark improves. The real attack surface moves. Everyone calls this iteration, because “organized whack-a-mole with GPUs” sounds less respectable.
The paper Safety Alignment of LMs via Non-cooperative Games argues that this is not merely a data-coverage problem. It is a game-design problem.1 If attackers and defenders adapt to each other, then training them as if one side moves first and the other side catches up is structurally misaligned with the thing being defended against.
The paper’s contribution is not simply “train with harder jailbreaks.” That would be the obvious summary, and it would miss the point. The real move is more precise: formulate safety alignment as a non-zero-sum, non-cooperative game between two separate language models, an Attacker and a Defender, then train them jointly using preference-based objectives.
That framing matters because the common intuition is too crude. A reader might assume safety alignment is a zero-sum contest: make the attacker stronger, force the defender to refuse harder, and safety improves. Nice story. Also dangerously incomplete. If the attacker is rewarded for anything that makes the defender fail, it may learn to produce gibberish, denial-of-service prompts, or weird artifacts that break the evaluation without teaching the defender anything useful. If the defender is rewarded too bluntly, it may become safe in the most corporate sense of the word: refusing everything that looks mildly suspicious, including perfectly benign requests.
AdvGame is interesting because it tries to avoid that trap at the level of incentives.
The old loop trains yesterday’s defender against yesterday’s attacker
The standard safety pipeline has a familiar rhythm:
- collect harmful or adversarial prompts;
- fine-tune the model to refuse, deflect, or answer safely;
- generate stronger attacks;
- repeat.
This can improve safety on known attack families. It also creates a delay. The defender is trained against a snapshot of the attacker’s previous behavior, not against an opponent adapting inside the same training loop.
Recent red-teaming systems partially automate the first step by using an Attacker LM to rewrite harmful prompts into more effective jailbreak attempts. That helps coverage. But if the process remains sequential, the interaction is still turn-based. The attacker improves, the defender catches up, then the attacker moves again.
The paper’s critique of self-play is sharper. In some domains, one model can play both roles: teacher and student, proposer and solver, problem writer and problem answerer. Safety is not that friendly. If the same model shares parameters across Attacker and Defender roles, objectives become entangled. A model capable of producing strong harmful attacks under one prompt is not obviously the model you want as a safety-aligned defender under another prompt. The authors argue that shared parameters can reduce attack diversity, leak role information, and encourage degenerate behavior.
The replacement is conceptually simple: keep the two roles separate.
| Role | What it tries to do | What makes the role non-trivial |
|---|---|---|
| Attacker LM | Rewrite prompts to induce harmful compliance or benign over-refusal | Must preserve the original intent; cannot win by changing the task into nonsense |
| Defender LM | Comply with benign prompts and deflect harmful prompts | Must remain useful, not merely allergic to risk |
| Judge system | Compare candidate prompts and responses | Must distinguish useful safety behavior from brittle refusal or reward hacking |
The key phrase is “non-zero-sum.” The Attacker’s reward is not just the negative of the Defender’s reward. That distinction sounds academic until you look at the failure it prevents.
The attacker should confuse the boundary, not break language
In a naive zero-sum setup, the attacker wins whenever the defender loses. But “defender loses” can mean several things. It could mean the defender provides harmful details. It could also mean the defender outputs irrelevant garbage, refuses a benign request, or collapses into some format error. From a real safety perspective, these are not equivalent.
AdvGame defines the attacker’s job more carefully. For harmful seed prompts, the attacker tries to rewrite the prompt so the defender complies when it should deflect. For benign seed prompts, the attacker tries to make the defender over-refuse when it should comply. In other words, the attacker is rewarded for semantic boundary confusion: making harmful look benign, and benign look harmful.
That is a much better adversary for business use. Enterprises do not mainly care whether a red-team system can produce exotic strings that break a benchmark classifier. They care whether the deployed model can handle plausible user behavior: veiled requests, ambiguous intent, policy-adjacent language, and benign tasks that resemble dangerous ones.
The paper also adds a faithfulness judge. The attacker’s rewritten prompt must preserve the original seed intent. This prevents the attacker from “winning” by quietly changing the task. Without this constraint, the training loop could reward adversarial creativity that no longer tests the safety policy it claims to test. A red team that changes the exam question is not clever. It is just grading itself.
Pairwise judging is less glamorous than scalar reward, which is why it matters
AdvGame does not rely primarily on scalar reward scores from a judge model. Instead, it uses pairwise preference judgments.
For the Defender, the judge compares two responses to the same attack query. If the seed is benign, it prefers the more compliant and helpful response. If the seed is harmful, it prefers the better deflection. Deflection here is not the same as blunt refusal. A good deflection acknowledges the user’s request, avoids operational harmful detail, and redirects toward safe adjacent information.
For the Attacker, the judge compares two rewritten prompts and the resulting defender behavior. The goal is to decide which attack better induces the wrong type of behavior, subject to faithfulness.
This choice is more than a training detail. Scalar reward models invite false precision. A judge asked to assign “7.3 safety points” to one response has to calibrate across context, policy nuance, and response style. A judge asked which of two responses is safer or more compliant has an easier job. The paper’s ablations support this: replacing pairwise judging with point-wise scoring worsens downstream safety. On Qwen2.5-7B, the point-wise judge variant reports HarmBench ASR of 14.1 and WJB ASR of 30.8, while the pairwise AdvGame-DPO-MD version reports 4.7 and 8.5 respectively.
This is one of the most useful business lessons in the paper. Evaluation design is not an administrative layer after training. It shapes what the model learns to exploit. If the reward interface is brittle, the model will find the brittleness. Models are not moral philosophers. They are very expensive loophole interns.
EMA makes the game less twitchy
The paper introduces AdvGame-DPO-MD and AdvGame-IPO-MD, where “MD” refers to a mirror-descent-inspired recipe using exponential moving average models for sampling. Practically, the training loop samples attacks and responses from EMA versions of the Attacker and Defender rather than only from the latest on-policy models.
That sounds like a stabilizer, and that is roughly the point. In adversarial training, pure on-policy learning can become twitchy: each side overreacts to the other side’s latest move. EMA slows the interaction enough to make learning more stable.
The ablation is quite direct. On Qwen2.5-7B, removing EMA from AdvGame-DPO-MD worsens HarmBench ASR from 4.7 to 19.9 and WJB ASR from 8.5 to 49.5. Utility also suffers on IFBench, dropping from 30.7 to 27.3. The authors interpret this as evidence that EMA-based off-policy generation improves stability for DPO and IPO variants.
This does not mean off-policy generation is universally better. The appendix reports that GRPO behaves differently: in their GRPO setting, on-policy generation gives better safety than off-policy generation. That distinction matters. The evidence supports EMA as a critical design choice for the preference-optimization variants, not as a magic ingredient for every RL safety recipe.
The main evidence is a frontier shift, not a single benchmark win
The paper evaluates two base model families: Qwen2.5-7B and Llama3.1-8B. The main comparison includes original instruction-tuned models, Self-RedTeam, and AdvGame variants. Utility is measured on benchmarks such as MMLU, IFBench, AlpacaEval2, and ArenaHard. Safety is measured using Attack Success Rate on HarmBench, WildJailbreak, DAN, and WildGuardTest. Compliance is measured on benign prompts, including WJB-benign and XSTest-benign.
The important pattern is the tradeoff. Safety training often improves refusal behavior by damaging usefulness or benign compliance. AdvGame-DPO-MD generally improves safety while preserving utility more effectively than Self-RedTeam.
For Qwen2.5-7B:
| Method | MMLU | IFBench | ArenaHard | HarmBench ASR | WJB ASR | DAN ASR | WJB benign compliance |
|---|---|---|---|---|---|---|---|
| Original | 71.8 | 29.4 | 55.5 | 31.6 | 69.6 | 36.3 | 100.0 |
| Self-RedTeam | 71.9 | 25.9 | 52.2 | 16.8 | 41.1 | 36.3 | 98.4 |
| AdvGame-DPO-MD | 71.8 | 30.7 | 61.3 | 4.7 | 8.5 | 10.3 | 94.4 |
The Qwen result is the cleaner story. Utility is preserved or improved on several reported metrics, while ASR drops substantially. Compliance declines from 100.0 to 94.4 on WJB-benign, which is not free, but it is not the kind of collapse that makes a safety model useless.
For Llama3.1-8B:
| Method | MMLU | IFBench | ArenaHard | HarmBench ASR | WJB ASR | DAN ASR | WJB benign compliance |
|---|---|---|---|---|---|---|---|
| Original | 69.3 | 28.2 | 33.6 | 25.0 | 58.6 | 49.3 | 98.8 |
| Self-RedTeam | 64.8 | 22.3 | 13.6 | 14.9 | 11.5 | 32.3 | 72.8 |
| AdvGame-DPO-MD | 69.1 | 26.4 | 41.0 | 7.4 | 6.4 | 42.0 | 69.9 |
Here the story is more mixed. AdvGame-DPO-MD preserves utility much better than Self-RedTeam and improves safety on HarmBench and WJB. But DAN remains weak, and WJB-benign compliance falls sharply to 69.9. The appendix explains this as partly related to attacker capability and distribution mismatch: WJB-benign contains benign prompts with jailbreak-like structure, and the weaker Llama attacker may not cover that benign attack surface as effectively during training.
This is exactly why mechanism-first reading matters. A leaderboard summary would say “AdvGame wins.” A serious reading says: AdvGame improves the frontier, but model family, attacker strength, and benign adversarial coverage matter.
The ablations explain why the method works
The paper’s ablations are not decorative. They are the best evidence that the gains come from the proposed mechanism rather than accidental tuning.
| Test | Likely purpose | Result pattern | What it supports |
|---|---|---|---|
| DPO-MD vs IPO-MD vs GRPO | Main variant comparison | DPO-MD and IPO-MD are more stable and safer than GRPO on Qwen | Pairwise preference optimization is a better fit than scalar-reward GRPO in this setup |
| Fixed attacker vs trained attacker | Ablation | Fixed attacker worsens WJB and DAN safety | The defender needs an improving opponent, not a static prompt generator |
| Pairwise judge vs point-wise judge | Ablation | Point-wise judging gives worse safety and utility | Relative judgments reduce reward hacking compared with scalar scoring |
| No EMA vs EMA | Ablation / stability test | Removing EMA sharply worsens safety | Off-policy EMA sampling stabilizes DPO-style adversarial learning |
| Smaller judge vs larger judge | Sensitivity / implementation test | Smaller judge can show higher reward but worse safety | Higher reward curves can indicate judge exploitation, not real improvement |
| Single-turn defense comparison | Comparison with prior work | AdvGame has stronger utility and competitive safety, but weak DAN | The method is not universally dominant; it trades differently along the frontier |
The fixed-attacker ablation is especially important. A static attacker still produces some useful adversarial data, but it does not keep discovering harder failure modes as the defender changes. On Qwen2.5-7B, the fixed-attacker variant has WJB ASR of 16.6 and DAN ASR of 15.0; the trained-attacker version improves these to 8.5 and 10.3.
That is the difference between red-teaming as a dataset and red-teaming as an adaptive capability.
The judge-size ablation is also a useful warning. A smaller judge produced higher apparent reward curves, but downstream safety got worse. This is the classic evaluation trap: the metric says the model is improving because the model has learned how to please the metric. In enterprise settings, this is where governance teams usually discover that a dashboard is not a control system.
The attacker becomes an asset, not training waste
One underappreciated result is that the trained Attacker model remains useful after training. The paper evaluates the AdvGame-DPO-MD Attacker as a red-teaming model against several defenders and compares it with PAIR, TAP, and GCG on HarmBench.
For Qwen Original, the AdvGame Attacker reaches ASR of 55.6, compared with 45.0 for PAIR, 48.8 for TAP, and 61.6 for GCG. Against the AdvGame-DPO-MD Qwen Defender, the same attacker reaches 11.3, close to PAIR at 7.2 and TAP at 10.0, while GCG remains higher at 25.3. For Llama, AdvGame Attacker is also competitive, though not uniformly dominant.
This matters operationally. The training process yields two artifacts:
- a Defender that is more robust against adaptive natural-language attacks;
- an Attacker that can be reused for authorized red-team testing.
That second artifact changes the economics of safety work. Instead of treating red-team prompts as a disposable dataset, an organization can maintain an evolving red-team model. The attacker becomes part of the safety infrastructure: a reusable probe for model updates, policy changes, tool integrations, and agent workflows.
The boundary is important. The paper’s impact statement explicitly treats attacker-related artifacts as controlled evaluation resources. That is the right framing. A capable red-team model is not something to throw into a public demo and hope everyone behaves.
What Cognaptus infers for business use
The paper directly shows that AdvGame variants, especially DPO-MD and IPO-MD, can improve the safety–utility–compliance frontier on the evaluated model families and benchmarks. It also shows that several mechanism choices matter: separate models, non-zero-sum attacker incentives, pairwise preference judging, faithfulness filtering, and EMA-based sampling.
The business inference is broader but not automatic.
For companies deploying LLMs in customer service, internal copilots, compliance review, legal intake, finance workflows, or agentic automation, safety should be treated as a continuous adversarial system. Static prompt filters and one-time jailbreak test suites are not enough when model behavior changes with fine-tuning, retrieval, tools, policies, and user workflows.
A practical safety operating model would look like this:
| Paper mechanism | Business translation | ROI relevance | Boundary |
|---|---|---|---|
| Separate Attacker and Defender | Maintain an internal red-team model distinct from the production model | Better recurring diagnosis of safety failures | Requires controlled access and governance |
| Faithfulness judge | Ensure adversarial tests preserve the original business-relevant intent | Reduces fake wins from irrelevant attacks | Depends on judge reliability |
| Pairwise preference judging | Evaluate competing outputs relatively, not only with scalar scores | More robust safety review and model selection | Still judge-model dependent |
| EMA/off-policy sampling | Avoid overreacting to the latest adversarial artifact | More stable safety iteration | Shown mainly for DPO/IPO variants |
| Compliance benchmarks | Measure over-refusal, not just harmful compliance | Prevents safety work from destroying usability | Needs domain-specific benign edge cases |
The ROI is not “cheaper training.” In fact, AdvGame is not cheap. The reported experiments used 16 NVIDIA H200 GPUs, split between sampling rollouts and training, with runs budgeted for 48 hours or 1,000 steps. That is not a weekend notebook exercise.
The better business case is cheaper diagnosis over repeated deployment cycles. If an organization frequently updates models, adds tools, changes policies, or deploys agents into higher-risk workflows, a reusable attacker and structured preference-evaluation loop can reduce the cost of discovering failure modes late. Late safety discovery is expensive because it arrives after product teams have already built around a false sense of security. Very elegant. Very preventable. Usually very invoiced.
Where the evidence should not be overextended
The paper is strong, but its boundaries are clear.
First, AdvGame focuses mainly on natural-language attacks. The authors explicitly distinguish this from non-readable attack strings such as GCG-style suffixes. The adaptive attack results show better robustness against PAIR and TAP, which generate human-readable adversarial prompts, while GCG often remains harder.
Second, the results are model- and benchmark-dependent. Qwen shows a cleaner safety–utility–compliance improvement than Llama. Llama’s WJB-benign compliance drop is not a footnote; it is a real operational warning. A safety method that causes benign over-refusal can damage workflows even while improving harmful-prompt safety.
Third, judge quality matters. The paper’s own ablations show that smaller or point-wise judges can produce misleading reward signals. In a business deployment, judge models would need independent evaluation, policy calibration, and domain-specific test cases. A judge that can be gamed becomes a compliance theater machine with better typography.
Fourth, the method is computationally heavier than single-model defenses. The cost may be justified for frontier labs, regulated enterprises, high-risk agents, or companies deploying LLMs into sensitive workflows. It is harder to justify for low-risk chatbots where simpler guardrails, policy filters, and evaluation suites may be sufficient.
Finally, the paper does not prove that non-cooperative game training solves LLM safety. It shows that a carefully designed attacker–defender game can improve robustness while preserving more utility than key baselines under the evaluated conditions. That is valuable enough. No need to staple a revolution banner to it.
Safety alignment becomes process control
The main lesson of AdvGame is not that every company should immediately train its own attacker–defender pair. The lesson is that safety alignment should be designed around interaction dynamics.
Attackers adapt. Defenders adapt. Judges can be hacked. Over-refusal is also failure. Static datasets age. Sequential patching lags. Shared-role self-play can confuse incentives. These are not annoying details around the safety problem. They are the safety problem.
AdvGame’s contribution is to make the interaction explicit. It separates the attacker from the defender, gives each side a carefully designed objective, judges behavior comparatively, filters unfaithful attacks, and stabilizes the training loop with EMA-based sampling. The result is not perfect safety, but a more realistic training game.
That is the useful shift: from patching jailbreaks to managing an adaptive system.
For business leaders, the paper’s message is blunt. If your AI safety process is a checklist that runs after the model is already built, it is probably too slow. If your red-team process is a folder of old prompts, it is probably too static. And if your safety metric rewards refusal without measuring compliance, congratulations: you may be building a model that protects users by refusing to be useful.
Safety stops being a turn-based game when the defender learns while the attacker is still moving.
Cognaptus: Automate the Present, Incubate the Future.
-
Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, and Arman Zharmagambetov, “Safety Alignment of LMs via Non-cooperative Games,” arXiv:2512.20806. ↩︎