Opening — Why this matters now

Multi-agent systems are quietly becoming the backbone of modern automation: warehouse fleets, financial trading bots, supply-chain optimizers, and—if you believe the more excitable research labs—proto-agentic AI organizations. Yet there’s a peculiar, recurring problem: when you ask agents to improve by playing against each other, they sometimes discover that the fastest route to “winning” is to make sure nobody wins.

Self-sabotage. The AI version of pulling the fire alarm so you don’t have to take the exam.

This failure mode has haunted cooperative reinforcement learning for years. It made adversarial training unreliable, diversity training brittle, and robustness checks simulate worlds that no sane agent would inhabit.

A recent NeurIPS 2025 paper introduces a surprisingly elegant fix: Rationality-Preserving Policy Optimization (RPO) and its practical implementation, Rational Policy Gradient (RPG). The idea is simple but powerful: if you force adversarial agents to remain rational, the whole system stops collapsing into dysfunctional equilibrium.

For organizations depending on autonomous decision systems, this is not a niche research upgrade. It is the difference between deploying models that adapt constructively—and models that learn to torch the environment when pressured.

Background — Cooperative games aren’t zero-sum, but our algorithms still behave like it

Zero-sum self-play (e.g., AlphaGo) works because incentives align with competition. When one policy improves by beating another, the “losing” policy learns to counterattack. The dance escalates; performance improves.

But cooperative or general-sum environments behave differently. If two agents share rewards but one is trained adversarially, the quickest way to minimize your partner’s score is to tank the task entirely. Existing adversarial optimization approaches made things worse:

  • Adversarial Training: agents learn to destroy shared reward.
  • Adversarial Diversity: populations learn superficially different but equally useless sabotage strategies.
  • PAIRED-like methods: adversaries design “hard environments” by making them unplayable.

The core pathology: agents stop acting as if rewards matter. In cooperative settings, adversarial incentives can push them into irrational behavior.

RPO reframes the optimization: adversarial agents are allowed to optimize their objective only if their resulting policy could be a best response to some plausible partner. No imaginary psychopaths allowed.

Analysis — What the paper contributes

The authors introduce two key pieces:

1. Rationality-Preserving Policy Optimization (RPO)

A constraint added to any multi-agent optimization objective:

An agent’s policy must be optimal for some possible co-player policy.

This prevents irrational best responses whose only purpose is sabotage.

2. Rational Policy Gradient (RPG)

A practical algorithm that operationalizes the constraint. RPG adds:

  • Base agents: learn normally, maximizing their own reward.
  • Manipulator agents: guide base agents toward adversarial or diverse outcomes without pushing them off the cliff of irrationality.

The manipulators use opponent-shaping techniques to influence future learning steps rather than immediate actions.

In essence, RPG trains adversarial agents that still behave like they care about the game. This unlocks a suite of previously unstable objectives:

  • finding rational adversarial examples (no sabotage)
  • learning robust cooperative strategies
  • generating meaningfully diverse policies

Findings — What actually improves

Across Overcooked, STORM, matrix games, and simplified Hanabi variants, RPG is consistently superior.

Below is a distilled comparison.

Table 1 — What happens without rationality constraints

Method Behavior in Cooperative Settings Outcome
Self-Play Learns brittle conventions Poor cross-play robustness
Adversarial Training Learns to sabotage No useful robustness signals
Adversarial Diversity Learns fake diversity via sabotage Collapse in cross-play
PAIRED Generates impossible environments No usable adversarial curriculum

Table 2 — RPG-powered variants (RPO applied)

RPG Variant What It Achieves Why It Matters
AP-RPG Finds rational adversarial examples Models can be stress-tested without pathological agents
AT-RPG Robustifies cooperative agents Safer multi-agent deployment
PAIRED-RPG Builds useful regret-maximizing curricula Enables training for edge cases
AD-RPG True strategy diversity without sabotage Generates meaningful behavioral variety

Visualization — Self-play vs Cross-play behavior

A typical pattern emerges:

  • Baselines: cross-play scores collapse to zero, indicating dysfunction.
  • RPG methods: cross-play stays healthy while still producing strategic variety.

This is not just technical cleanliness. It signals that RPG-trained systems generalize across unseen teammates—critical for real-world multi-agent deployments.

Implications — Why business leaders should care

Multi-agent autonomy is already creeping into:

  • logistics (robot fleets, autonomous warehouse coordination)
  • marketplaces (bidding agents, pricing agents)
  • energy grids (distributed consumption/production agents)
  • financial systems (algorithmic trading, automated AMMs)
  • gaming and consumer AI experiences

All these domains share dependencies on cooperative but decentralized AI systems.

RPG makes three practical outcomes achievable:

1. Robust testing without pathological behaviors

Stress-testing agents requires adversaries that break you in realistic ways—not via tantrums.

2. Interoperable autonomous agents

RPG-trained agents maintain high reward even when paired with unfamiliar partners. This is crucial for markets, supply chains, and open multi-agent ecosystems.

3. Safer auto-curricula

Many MARL systems suffer from curriculum collapse—agents get “stuck” in dull, local solutions. RPG keeps the search space lively without pushing agents into destructive loops.

In short: if your systems involve multiple learning agents, RPG formalizes the difference between “productive conflict” and “mutually assured destruction.”

Conclusion — Rationality as a competitive advantage

The paper’s insight is subtle but transformative: adversarial optimization is not the enemy. Irrational adversarial optimization is. By constraining adversarial objectives through rationality, RPG reopens a massive design space in multi-agent learning — previously abandoned as too unstable.

Businesses building agentic systems—especially for automation, markets, and distributed decision-making—should pay attention. Rationality constraints are becoming the new safety rails.

Cognaptus: Automate the Present, Incubate the Future.