When Policies Read Each Other: Teaching Agents to Cooperate by Reading the Code

Opening — Why this matters now

Multi-agent systems are finally leaving the toy world.

Autonomous traders negotiate with other bots. Supply-chain agents coordinate across firms. AI copilots increasingly share environments with other AI copilots. And yet, most multi-agent reinforcement learning (MARL) systems are still stuck with a primitive handicap: agents cannot meaningfully understand what other agents are doing.

They guess. They infer. They overfit to yesterday’s behavior.

The paper Policy-Conditioned Policies for Multi-Agent Task Solving argues that this limitation is not accidental — it is structural. And then it does something unfashionable in deep learning: it changes the representation.

Instead of learning policies as opaque neural networks, it makes policies readable. Literally.

Background — Context and prior art

The MARL non-stationarity trap

In a Markov game, the environment changes whenever another agent updates its policy. From the perspective of any single learner, the world is never stationary. Classical RL guarantees quietly evaporate.

The standard workaround is opponent modeling: infer the other agent’s policy from interaction history. This is fragile for two reasons:

Problem	Why it hurts
Policy drift	You’re learning from a moving target
Recursive beliefs	I model you modeling me modeling you

Even with centralized critics or belief modules, the agent never truly knows what the opponent is optimizing.

Program Equilibrium: brilliant, useless (until now)

Game theorists solved this decades ago — at least on paper.

In Program Equilibrium, agents submit programs instead of strategies. Each program can read the source code of the other and condition its behavior accordingly. This enables cooperation in one-shot games that Nash equilibrium forbids.

The catch? Classical program equilibrium requires formal proofs, exact code matching, and provability logic. In practice: undecidable, brittle, and entirely disconnected from modern learning systems.

Until LLMs showed up.

Analysis — What the paper actually does

The representational bottleneck

Deep RL policies are terrible communication objects:

Millions of parameters
Non-unique representations
No semantic structure

Conditioning one neural policy on another is not just expensive — it is meaningless.

The paper’s core move is to replace neural policies with executable source code. Suddenly:

Policies are compact
Behavior is explicit
Semantics are inspectable

LLMs, trained on code and natural language, become approximate interpreters of these policies.

LLMs as best-response operators

Instead of learning a policy directly, the agent learns a mapping:

Opponent policy code → My policy code

Formally, this is a best-response operator implemented by an LLM prompt. Given the opponent’s code, the model generates a responding policy — also as code.

Optimization happens not in parameter space, but in prompt space.

Programmatic Iterated Best Response (PIBR)

PIBR is the operational core of the paper. Conceptually:

Fix opponent policy code
Use an LLM to generate ego-agent policy code
Execute it
Score it with:
- Utility feedback (game rewards)
- Unit-test feedback (does the code actually run?)
Refine the prompt via textual gradients
Alternate agents

This is fictitious play — but lifted into code space.

Classical MARL	PIBR
Gradient on parameters	Gradient on text
Black-box policies	Interpretable programs
Implicit beliefs	Explicit conditioning

Findings — Results with visualization logic

Coordination games: instant alignment

Across three matrix games (Vanilla Coordination, Climbing, Penalty), PIBR converges almost immediately to the globally optimal equilibrium.

Notably, even in games with severe miscoordination penalties, agents avoid bad equilibria after a single update.

Interpretation: once agents can read commitments, coordination is trivial.

Cooperative foraging: where cracks appear

In the Level-Based Foraging gridworld, PIBR still finds high-performing strategies — but stability degrades.

The mean performance fluctuates, and later iterations sometimes regress. The algorithm ultimately selects the best historical policy profile, not the last one.

This reveals an uncomfortable truth: once policies become expressive programs, optimization becomes powerful — and volatile.

Implications — Why this matters beyond games

For AI systems

This is a concrete bridge between agentic LLMs and game-theoretic learning
It replaces belief inference with policy inspection
It reframes learning as meta-programming

For governance and safety

Readable policies are auditable policies.

PIBR-style systems make commitments explicit, testable, and debuggable — properties regulators actually understand.

For business automation

Multi-agent workflows — pricing bots, negotiation agents, supply optimizers — benefit less from raw intelligence and more from predictable coordination.

Programmatic policies offer exactly that.

Conclusion — What to take away

This paper is not about making LLMs play games better.

It is about dismantling a silent assumption in multi-agent learning: that policies must be opaque. Once that assumption falls, entire classes of coordination problems become easy — almost embarrassingly so.

The result is not a new algorithm.

It is a new language for multi-agent systems.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

The MARL non-stationarity trap#

Program Equilibrium: brilliant, useless (until now)#

Analysis — What the paper actually does#

The representational bottleneck#

LLMs as best-response operators#

Programmatic Iterated Best Response (PIBR)#

Findings — Results with visualization logic#

Coordination games: instant alignment#

Cooperative foraging: where cracks appear#

Implications — Why this matters beyond games#

For AI systems#

For governance and safety#

For business automation#

Conclusion — What to take away#