Don’t Self-Sabotage Me Now: Rational Policy Gradients for Sane Multi-Agent Learning

Kitchen work is not hard because chopping onions is metaphysically difficult. It is hard because two people must agree, implicitly and quickly, who gets the onion, who holds the plate, who waits by the pot, and who moves out of the corridor before everyone performs a small culinary traffic accident.

That is why Overcooked remains such a useful multi-agent benchmark. It turns coordination into something visible. Agents do not merely need to “perform a task”; they need to infer what another agent is about to do and avoid becoming a sentient obstacle.

The paper behind today’s article, Robust and Diverse Multi-Agent Learning via Rational Policy Gradient, studies what happens when adversarial training enters this kind of cooperative world.¹ The answer is not flattering. If a cooperative agent is trained to minimise another agent’s reward, it can discover a very efficient trick: stop cooperating. Block the plate dispenser. Refuse to collect coins. Stand somewhere inconvenient. Congratulations, the partner’s score is now low. The experiment has technically succeeded, which is exactly the problem.

This is the paper’s useful irritation. Low cross-play performance is often treated as evidence that agents learned diverse strategies. In cooperative settings, it may instead mean they learned to sabotage the task. The metric smiles. The behaviour is nonsense. A familiar situation, really.

The authors propose Rationality-Preserving Policy Optimization, or RPO, and implement it through Rational Policy Gradient, or RPG. The core idea is simple enough to be dangerous: adversarial agents may attack, but they must remain rational. They are allowed to expose real coordination failures. They are not allowed to win by acting against their own interests.

That distinction is the paper’s centre of gravity.

The failure is not adversarial pressure; it is irrational adversarial pressure

Adversarial optimisation works beautifully in zero-sum settings because the incentives are clean. If one Go policy beats another, the losing policy has learned something. The opponent’s pressure is informative because the opponent’s objective is naturally aligned with the structure of the game.

Cooperative and general-sum games are less polite.

In a shared-reward task, an “adversary” trained to minimise another player’s reward can often reduce both players’ rewards together. That is self-sabotage: behaviour that satisfies the external training objective while violating the agent’s own incentives inside the underlying game.

The paper formalises this with a small matrix-game example before moving to richer environments. The point of the toy game is not decorative. It isolates the mechanism. A conventional adversary can choose an action that makes the victim’s reward low regardless of what the victim does. But that action is not a useful stress test. It teaches the victim nothing actionable because no alternative victim policy would perform well against it.

RPO changes the admissible attack. A policy must be a best response to at least one possible co-player policy. Put less formally: the adversary’s behaviour must make sense in some plausible world. It may be inconvenient, incompatible, or strategically awkward. It may not be pure vandalism wearing a lab coat.

This matters because adversarial training is supposed to generate information. It should reveal where the agent’s assumptions break. Irrational sabotage reveals only that shared tasks fail when one participant stops caring about the task. This is not exactly Nobel material.

RPG turns rationality into a training mechanism

RPO is the formal constraint. RPG is the machinery that makes it trainable.

The algorithm introduces two roles:

Role	What it optimises	Why it exists
Base agent	Its own reward in the underlying game	Keeps the learned policy rational
Manipulator agent	The adversarial objective, indirectly	Steers the base agent toward useful hard cases

The base agent trains against a corresponding manipulator, but it does so by maximising its own reward. This is the key enforcement mechanism. The base policy is not directly told to sabotage anyone. It is shaped into a policy that remains reward-seeking.

The manipulator then uses opponent-shaping: it takes gradients through the base agent’s learning update and tries to influence where that base agent will move next. The manipulation is indirect. It does not control the final policy like a puppet. It sets up learning conditions that cause the base policy to arrive at rational but adversarially useful behaviour.

That sounds like technical fussing until you notice what it buys. Existing adversarial algorithms can be rebuilt inside the RPG framework:

RPG variant	Original idea being repaired	Practical purpose
AP-RPG	Adversarial Policy	Find rational weaknesses in a fixed victim policy
AT-RPG	Adversarial Training	Train a victim while rational adversaries expose weaknesses
PAIRED-RPG	Regret-based environment or partner design	Build harder curricula without unplayable sabotage
PAIRED-A-RPG	Fixed-victim regret attack	Find cases where a victim underperforms a stronger alternative
AD-RPG	Adversarial Diversity	Learn genuinely different policies rather than fake diversity through obstruction

The paper also adds partner-play regularisation. This is an implementation detail with strategic importance. Base agents train with manipulators but are evaluated with other base agents. That creates distribution shift. Partner-play mixes in rollouts with actual base-agent partners so manipulators cannot exploit the gap by pushing base agents into strange training-only behaviour.

The mechanism, then, is not “make agents nicer.” It is more precise: keep the agent’s policy inside the set of behaviours that could be optimal for some partner, while still searching for adversarially useful incompatibilities.

The diversity result is really a metric hygiene result

The first empirical claim is that RPG can find meaningfully diverse policies. The more interesting claim is that it can tell the difference between diversity and fraud.

The paper tests adversarial diversity in STORM and Overcooked. In adversarial diversity, agents are encouraged to perform well in self-play but poorly in cross-play. At first glance, low cross-play is desirable. It suggests that different policies have discovered incompatible conventions.

But this metric is easy to cheat. In Overcooked, the paper reports that AD and CoMeDi achieve low cross-play in the cramped room layout, but the qualitative behaviour reveals the trick: agents can simply block the plate dispenser, preventing partners from finishing dishes. The score falls. The “diversity” is fake.

AD-RPG behaves differently. In the cramped room result, the paper reports self-play and cross-play rewards of 240 and 240 for AD-RPG, compared with 240 and 1.25 for AD, and 220 and 2 for CoMeDi. The interpretation is subtle. AD-RPG is not “less diverse” in the naive sense; it is refusing to count sabotage as diversity. In this particular Overcooked layout, high cross-play under AD-RPG implies that there may be little genuine strategic diversity available without intentionally obstructing the partner.

That is a useful correction for business readers. A low compatibility score among agents can mean several things:

Observation	Naive interpretation	Better interpretation
Low cross-play, high self-play	The population learned diverse conventions	Possibly true, but inspect whether agents remain rational
Low cross-play with obstruction	Strong adversarial discovery	Fake diversity through sabotage
High cross-play under rational adversarial diversity	No useful diversity was found	Or the environment has limited genuine diversity
Moderate cross-play with adaptation	Agents found incompatible defaults but can recover	This is often the useful case

STORM makes the point more clearly. In the coin-collection environment, AD-RPG drives agents toward different initial colour preferences, reducing cross-play because their conventions differ. But when paired incompatibly, the agents can still adapt and collect matching coins for partial reward. The cross-play score falls only as far as needed to express real incompatibility. It does not collapse into refusal.

The paper also modifies STORM to include an easy sabotage state: standing in a particular grid square gives both agents a negative reward. CoMeDi learns to exploit this sabotage route, while AD-RPG avoids it. This is best read as a comparison with prior work, not a second main thesis. It shows why observation-distribution tricks are insufficient: self-sabotage can arise even when the observation structure is not the central issue.

Robustness improves when training pressure stays inside plausible behaviour

The second empirical claim is that AD-RPG can train policies that are more robust to unfamiliar partners.

The paper evaluates cross-play across multiple Overcooked layouts and simplified Hanabi variants. The Overcooked results are the stronger part of this evidence. Self-play agents often coordinate well with themselves or with similar seeds but fail against agents trained under different conventions. That is the familiar brittleness of cooperative self-play: everyone learns the same private handshake, then looks surprised when the next partner does not know it.

AD-RPG performs well across partners in several Overcooked settings, especially in forced coordination, coordination ring, and counter circuit. The authors report that AD-RPG rarely collapses to near-zero reward with itself in cross-play and maintains higher mean intra-population cross-play rewards than the baselines. In forced coordination, they also note an asymmetry: one player role must adapt to onions passed by the other, and AD-RPG in that role adapts well to multiple passing strategies.

This is the business-relevant result, but it should not be inflated. The direct evidence is still from simulated games. What the paper shows is that rational adversarial diversity can create training pressure that makes policies less brittle to partner variation. What Cognaptus infers is that similar machinery could improve partner-generalisation tests for agent fleets, robot teams, workflow agents, market bots, and game AI.

The Hanabi results are more cautious. High-entropy self-play performs surprisingly well in the simplified 3-colour and 4-colour versions. The authors speculate that this may be because entropy pushes policies toward similar strategies, or because the policies lack history-dependence, or because simplified Hanabi offers fewer opportunities for specialised conventions. This is not a failure of the paper. It is a useful boundary: RPG is not magic dust sprinkled over all coordination benchmarks. Sometimes the benchmark itself narrows the strategic space.

Rational adversarial examples are more useful than broken partners

The third claim is that RPG variants can find rational adversarial examples.

This is where the paper becomes operationally interesting. In robustness testing, the adversary should not merely make performance bad. It should make performance bad in a way that reveals a realistic weakness.

The unobserved STORM experiment makes this distinction visible. The authors compare victims trained by several algorithms against different attack types. A standard adversarial policy attack drives every victim to zero reward because it self-sabotages by collecting no coins. That is technically an attack. It is also useless. It says: “Your cooperative system fails when the partner refuses to cooperate.” Duly noted. We shall inform the board.

By contrast, AP-RPG and PAIRED-A-RPG search for weaknesses while avoiding sabotage. The table in the paper shows that PAIRED-RPG and AT-RPG victims retain higher reward under RPG-based attacks than their non-RPG counterparts. For example, PAIRED-RPG trains to 0.93 and scores 0.84 against PAIRED-A-RPG and 0.85 against AP-RPG; AT-RPG trains to 0.65 and scores 0.72 and 0.88 against those attacks. Non-RPG AT, AD, and standard AP collapse into zero-reward sabotage patterns.

The Overcooked adversarial example is even more intuitive. A self-play policy scores 240 with itself in cramped room. AP-RPG finds an adversarial partner with which it scores only 4.6. The discovered adversary does not simply block the task. It follows a different but rational movement convention: the victim assumes agents will move around each other clockwise, while the adversary moves counterclockwise. The result is a coordination failure.

That is the sort of failure a deployment team can use. It points to a brittle convention, not a tantrum.

How to read the paper’s evidence without over-reading it

The experiments serve different purposes. Treating them all as one undifferentiated “RPG wins” pile would miss the paper’s actual argument.

Evidence block	Likely purpose	What it supports	What it does not prove
Matrix-game examples	Mechanism isolation and robustness/sensitivity exploration	Naive adversarial objectives can create irrational policies; RPG can eliminate dominated sabotage behaviours	Performance in high-dimensional real-world systems
STORM diversity results	Main evidence for rational diversity	AD-RPG can create incompatible but still adaptive strategies	That all cooperative domains contain rich useful diversity
Modified STORM sabotage test	Comparison with prior anti-sabotage approaches	Observation-distribution fixes can fail when sabotage is incentive-driven	That CoMeDi fails in every possible environment
Overcooked cramped room diversity	Main evidence plus metric correction	Low cross-play in baselines can be fake diversity through obstruction	That high cross-play always means no diversity exists
Overcooked cross-play grids	Main evidence for robustness	AD-RPG can improve partner generalisation across layouts	Human-level coordination or real-world transfer
Simplified Hanabi	Boundary and benchmark check	Results depend on policy class and environment structure	Full Hanabi mastery or general zero-shot coordination
Appendix implementation details	Implementation detail	RPG is feasible with actor-critic, differentiable optimisers, Loaded DiCE, and partner-play	That it is cheap or stable by default
Appendix compute and limitations	Boundary condition	Higher-order gradients are expensive and high-variance	Production readiness at large scale

This reading also explains why the mechanism-first structure matters. The paper is not merely introducing a new MARL algorithm family. It is diagnosing a measurement failure: adversarial objectives can turn cooperative evaluation into a theatre of irrational behaviour. RPG is valuable because it restores interpretability to the stress test.

The business value is better diagnosis, not instant autonomy

For companies, the most practical lesson is not “use RPG tomorrow.” It is that robustness testing for multi-agent systems must distinguish destructive partners from rationally difficult partners.

That distinction matters across several emerging settings.

In robot fleets, a hard partner might take a different route, prefer a different handoff point, or reserve shared space differently. An irrational partner parks in the corridor and blocks everyone. Training against the second case may improve recovery behaviour at the margins, but it mostly teaches the obvious: blocked systems stop working.

In agentic workflow software, one agent may summarise a document differently, delay an intermediate output, or choose a different decomposition of the task. Those are rational incompatibilities. A self-sabotaging test agent simply refuses to call the required tool or emits unusable output. That may test guardrails, but it does not test coordination.

In market and negotiation systems, rational adversarial pressure can reveal brittle assumptions about timing, signalling, or strategic conventions. Irrational pressure merely creates pathological scenarios that no incentive-compatible counterparty would choose.

The useful business pathway looks like this:

Identify where autonomous agents must coordinate with other agents, humans, services, robots, or market participants.
Replace “does performance drop under adversarial pressure?” with “does performance drop under rational adversarial pressure?”
Use rational adversarial examples to diagnose brittle conventions.
Train against those examples only when they reveal actionable partner variation.
Keep irrational sabotage tests separate as resilience or abuse testing, not as evidence of strategic diversity.

This is a governance point as much as a modelling point. A robustness benchmark that rewards sabotage will quietly select for nonsense. The dashboard will look scientific. The learned behaviour will be the AI equivalent of lying on the floor during a fire drill.

Where the method is still expensive and unsettled

The paper is unusually clear about its own boundaries.

RPG relies on higher-order gradients through agents’ learning updates. That adds computational overhead. The authors report that adversarial diversity is already roughly twice as slow as simpler training because it doubles the number of agents, and AD-RPG adds another roughly threefold slowdown through manipulator agents. Experiments ran on single A4000 or A6000 GPUs with 32 CPU cores and 50 GB of RAM, with seeds taking from about a minute to 24 hours.

Variance is another issue. Estimating higher-order gradients from samples is noisy, and the authors report needing large batch sizes to stabilise learning. Their implementation uses actor-critic rather than PPO because clipping could interfere with higher-order gradients. The optimiser choice also matters because the base agent update must remain differentiable.

The most important conceptual limitation is that RPG encourages rationality by ensuring base agents train to maximise their utility, but it does not formally guarantee convergence to truly rational strategies. The paper frames this as future work. That is the right posture. The method is a substantial step toward rational adversarial optimisation, not a certificate of strategic sanity.

There is also a scale boundary. The evidence comes from matrix games, STORM, Overcooked, and simplified Hanabi. These are useful benchmarks because they make coordination failures observable. They are not equivalent to enterprise-scale multi-agent systems with long horizons, tool use, partial observability, human intervention, changing incentives, and delightful organisational mess.

So the business reading should be disciplined: RPG demonstrates a mechanism for making adversarial multi-agent training more meaningful. It does not prove that large deployed agent ecosystems can be robustified cheaply, automatically, or once-and-for-all.

The real lesson: stop rewarding fake failure

The paper’s deepest contribution is not that it adds another acronym to the already thriving acronym rainforest. It is that it gives multi-agent learning a cleaner adversarial question.

Not: can I make the other agent fail?

But: can I make the other agent fail while still behaving like an agent that wants something coherent?

That question matters because multi-agent systems will increasingly be evaluated under stress: unfamiliar partners, shifting conventions, competitive pressure, awkward handoffs, and adversarial counterparts. If the test adversary is allowed to self-sabotage, the evaluation becomes noisy. It confuses robustness failures with absurd behaviour. It rewards agents for finding failure modes that are dramatic but uninformative.

RPO and RPG offer a better frame. Keep adversaries rational. Let them be difficult, incompatible, strategically inconvenient, even unpleasant. Just do not let them burn down their own reward and call it insight.

For businesses building autonomous systems, that is the difference between a useful red team and a room full of agents proudly discovering that cooperation fails when everyone decides to be useless.

Progress, apparently, begins by asking the adversary not to be an idiot. A modest request. Historically ambitious.

Cognaptus: Automate the Present, Incubate the Future.

Niklas Lauffer, Ameesh Shah, Micah Carroll, Sanjit A. Seshia, Stuart Russell, and Michael Dennis, “Robust and Diverse Multi-Agent Learning via Rational Policy Gradient,” arXiv:2511.09535, 2025, https://arxiv.org/abs/2511.09535. ↩︎

The failure is not adversarial pressure; it is irrational adversarial pressure#

RPG turns rationality into a training mechanism#

The diversity result is really a metric hygiene result#

Robustness improves when training pressure stays inside plausible behaviour#

Rational adversarial examples are more useful than broken partners#

How to read the paper’s evidence without over-reading it#

The business value is better diagnosis, not instant autonomy#

Where the method is still expensive and unsettled#

The real lesson: stop rewarding fake failure#