When Policies Read Each Other: Teaching Agents to Cooperate by Reading the Code

A workflow breaks in a familiar way.

The planning agent assumes the procurement agent will wait. The procurement agent assumes the planning agent has already revised the forecast. The compliance agent flags the output after both have acted. Everyone had access to the same dashboard. Nobody had access to the thing that actually mattered: the other agent’s decision policy.

This is the coordination problem hiding behind a lot of cheerful talk about “agentic AI.” When several autonomous systems act in the same environment, performance does not depend only on whether each agent is individually competent. It depends on whether each agent can adapt to how the others are deciding. In conventional multi-agent reinforcement learning, that is usually handled through history: observe what others did, infer what they might do next, and hope the inference remains useful after they update themselves. Very elegant. Also, frequently, a mess.

The paper behind today’s article, Policy-Conditioned Policies for Multi-Agent Task Solving, proposes a sharper alternative: instead of forcing agents to infer each other’s policies from behavior, represent those policies as executable source code and let agents condition directly on the opponent’s code.¹ The authors call their method Programmatic Iterated Best Response, or PIBR. The useful idea is not that LLM agents suddenly become wise diplomats. They do not. The useful idea is that policy transparency changes the coordination problem from guessing behavior to reading and revising executable logic.

That distinction matters. Business teams do not need another slogan about autonomous agents “collaborating.” They need a way to make automated decisions inspectable, testable, and mutually intelligible before those decisions hit production. Code-readable policies are one route toward that. Not a complete route. But a route worth understanding.

The hard part is not choosing an action; it is choosing against another chooser

Single-agent learning has a clean story. An agent acts in an environment, observes rewards, and improves its policy. The environment may be complex, but at least it does not deliberately rewrite itself because another learner changed its strategy.

Multi-agent learning removes that comfort. From one agent’s viewpoint, the environment is partly made of other agents’ policies. If those policies change, the effective environment changes too. A strategy that worked yesterday may stop working today, not because the warehouse moved, the customer changed, or the regulation shifted, but because another agent learned something.

The conventional response is opponent modeling. An agent watches interaction history and estimates what the other agent is likely to do. The paper correctly treats this as both necessary and fragile. The history is sampled from moving targets. The agents are learning at the same time. Worse, the recursion does not stop at “what will the other agent do?” It quickly becomes “what does the other agent think I will do after I think it will do that?” At this point, the elegant diagram has become a small philosophical accident.

The paper’s move is to ask a simple question: what if the policy did not need to be inferred?

If the other agent’s policy were directly visible, the best-response problem would change. The ego agent would not need to reconstruct hidden logic from past behavior. It could condition on the actual policy. This idea has a game-theoretic ancestor in program equilibrium, where agents submit programs that can inspect each other’s source code before acting. The old theoretical version is powerful but brittle: exact program reasoning, formal proofs, and self-reference are not exactly what most companies want near their invoice automation system before lunch.

The authors’ practical relaxation is to use LLMs as approximate interpreters of source code. Instead of requiring formal proof that another program will cooperate, an LLM reads the opponent’s policy code, generates a best-response policy code, runs it, receives feedback, and revises.

The mechanism is the article. The benchmark results are important, but secondary.

Neural policies are hard to read because weights are not policies in human form

The paper’s central technical diagnosis is the representational bottleneck of deep reinforcement learning.

A neural policy is stored as a set of parameters. In principle, those parameters define behavior. In practice, they are a terrible communication object. A million-weight vector is not a policy explanation. It is a compressed artifact of training. Two networks can behave similarly while having very different parameters. A small permutation inside a network can preserve function while destroying any naive parameter-level comparison. Asking one agent to parse another agent’s full neural weights is not “transparency.” It is asking a spreadsheet auditor to inspect a sandstorm.

Source code has different properties. It is structured. It is semantically dense. It can contain names, branches, comments, helper functions, and decision rules. More importantly, it belongs to the data modality where LLMs are already unusually capable: text and code.

The paper therefore shifts the representation of a policy from an opaque parameter vector to an executable program. The opponent’s policy is not merely observed through behavior; it is passed as source code into an LLM prompt. The LLM then generates the ego agent’s source code as a response.

That sounds like a small implementation detail. It is not. It changes the object being optimized.

Traditional reinforcement learning searches over policy parameters. PIBR searches over a code-generating operator: given the opponent’s policy code, produce a better ego policy code. In the authors’ formal language, the LLM acts as a point-wise best-response operator. “Point-wise” matters because the method is not learning a universal perfect best-response function for every possible opponent. It is optimizing a response to the policy currently in front of it.

For a business reader, the translation is straightforward:

Conventional agent coordination	Policy-conditioned programmatic coordination
Agents infer other agents from logs and interaction history.	Agents read the other agents’ executable decision logic.
The policy representation is often opaque model weights.	The policy representation is structured source code.
Debugging depends heavily on behavior traces after failure.	Debugging can inspect both behavior and decision rules before or during execution.
Adaptation is mainly statistical.	Adaptation is partly semantic and testable.

This is not automatically better in every setting. Code can be wrong, misleading, incomplete, or too complex. LLM interpretation can fail. But the representation is at least something a machine and a human can inspect. That alone is operationally meaningful.

PIBR turns “read the other policy” into an iterative code-repair loop

The method has two nested loops.

The outer loop alternates between agents. One agent’s policy is held fixed while the other agent updates its policy as a best response. Then the process switches. This resembles classical iterated best response dynamics, except the best response is not computed by a closed-form solver or a neural gradient update.

The inner loop is where the paper’s LLM machinery appears. Given the opponent’s policy code, the LLM generates the ego agent’s policy code. That generated code is then evaluated through two feedback channels:

Feedback signal	Likely purpose in the experiment	What it supports	What it does not prove
Runtime and unit-test feedback	Implementation detail and reliability mechanism	Generated policies can be corrected when they fail syntactically or crash during execution.	It does not prove strategic quality; working code can still be bad policy.
Utility feedback from game trajectories	Main optimization signal	The LLM receives evidence about whether the generated policy improves social welfare against the current opponent policy.	It does not guarantee convergence in more complex or adversarial environments.
Returning the best historical policy profile	Stabilization mechanism	If later iterations deteriorate, the algorithm can select the best observed profile rather than the final one.	It does not mean the learning trajectory itself is stable.

This is the most practically interesting part of the paper. PIBR does not rely only on an LLM’s first-shot reasoning. It wraps the LLM in a feedback loop. The generated policy must run. It must survive tests. It must produce utility in the environment. The LLM then receives structured criticism and revises the policy code.

In enterprise terms, the method looks less like “let agents talk to each other” and more like “turn agent policy generation into a sandboxed code-review cycle.” That is a more serious idea. It is also less glamorous, which usually means it has a better chance of being useful.

A simple diagram captures the mechanism:

Opponent policy code
        ↓
LLM as best-response operator
        ↓
Generated ego policy code
        ↓
Sandbox execution + unit tests
        ↓
Trajectory utility feedback
        ↓
Revised code policy

The key is that the opponent policy is not hidden. The method assumes a setting where policy disclosure is allowed. That assumption is not a footnote; it defines the use case.

The matrix games show clean coordination, but they are not the whole story

The experiments begin with three 3×3 coordination matrix games.

The first is a vanilla coordination game with three coordination equilibria. The best joint outcome comes from both agents selecting the highest-payoff matching action. The second is the Climbing Game, where miscoordination can be severely punished, making naive exploration risky. The third is the Penalty Game with $p=-2$, where some diagonal coordination choices are themselves bad, so the agents must avoid detrimental equilibria rather than merely coordinate anywhere.

These tests are controlled and useful. Their likely purpose is main evidence under simple coordination structure. They ask whether code-conditioned best responses can identify the globally desirable coordination point when the policy space is small enough to interpret.

The reported result is strong: PIBR rapidly reaches the global optimum across all three. In the vanilla coordination game, social welfare stabilizes at 6.0. In the Climbing Game, it stabilizes at 22.0. In the Penalty Game with $p=-2$, the agents coordinate on the optimal equilibrium within a single update step, reaching social welfare 20.0.

Those numbers should be read carefully. In shared-payoff matrix games, social welfare is the sum of both agents’ rewards. So social welfare 22.0 in the Climbing Game corresponds to both agents coordinating on the payoff-11 action. This is clean evidence that the method can read, generate, and align simple policies in transparent cooperative games.

But matrix games are also forgiving in a particular way. The world is tiny. The action space is discrete and small. The policy logic can collapse into a simple commitment: “the opponent plays action 0, so I play action 0.” The appendix’s generated Climbing Game policies illustrate exactly this. Both final policies choose action 0 with probability 1.0 and include comments describing the opponent’s commitment.

That is useful, but it is not yet a demonstration of complex operational autonomy. It is a demonstration that policy-readable code helps solve coordination games where the correct response can be expressed compactly. The champagne remains in the fridge.

The foraging task is the real stress test, and it is unstable

The more revealing experiment is the modified cooperative Level-Based Foraging task.

Here, two agents move in a 5×5 grid and must collect food. The authors use a fully cooperative variant where each food item requires both players to perform the load action together. Rewards are shared, and the environment is modified with a terminal penalty inversely related to episode length, making efficient collection the optimization objective.

This test has a different purpose from the matrix games. It is not merely another main result; it is an exploratory extension into sequential spatial coordination. The agents must handle movement, target selection, proximity, timing, and joint loading. A policy cannot be reduced to one static action.

The results are more mixed. Some individual samples approach an empirical optimum of approximately 0.554, but the mean trajectory shows large variance and instability. Performance drops in later stages. The paper states that the algorithm returns the agents’ strategies from the 16th optimization step because that step yields the highest social welfare.

That detail is important. Returning the best historical policy is a reasonable algorithmic design choice, but it also reveals that later optimization did not monotonically improve the policy. In simple matrix games, the mechanism looks crisp. In cooperative foraging, it starts to look like an LLM-generated policy search process that can find promising coordination behaviors but has trouble maintaining them.

The appendix gives a useful window into why. The generated foraging policies are not trivial. They include target scoring, softmax movement, opponent intent tracking, probabilistic loading, time-aware load probabilities, and small randomness to avoid deadlocks. This is exactly the kind of code that makes the approach interesting: readable, inspectable, and closer to operational logic than a raw neural policy. It is also exactly the kind of code that can become brittle. A heuristic that helps one phase may hurt another. Randomness can avoid deadlocks but reduce consistency. Partner tracking can help coordination, but only if the inferred intent is reliable.

So the right interpretation is not “PIBR solves complex multi-agent coordination.” The right interpretation is: policy-conditioned code generation can produce meaningful sequential coordination logic, but the current method has not yet stabilized that logic in more complex environments.

That is a better result than a vague success claim. It tells us where the mechanism starts to strain.

The paper’s evidence is best read by test purpose, not by headline

The paper contains several types of evidence. Mixing them together would overstate the result. A cleaner reading is to separate the experimental roles.

Paper component	Likely purpose	What it shows	Practical boundary
Representational bottleneck argument	Mechanistic thesis	Neural weights are poorly suited for direct policy conditioning; source code is more parseable.	It is a conceptual argument, not an empirical proof that all useful policies should be code.
PIBR algorithm	Method contribution	LLMs can be used as code-input/code-output best-response operators refined by tests and utility feedback.	Depends on policy visibility, sandboxing, and the LLM’s ability to interpret the code.
Matrix games	Main evidence in controlled settings	Fast coordination to global optima in small cooperative games.	Small action spaces and compact policies make the task unusually clean.
Cooperative foraging	Exploratory extension / stress test	The method can generate nontrivial sequential coordination policies, but performance is unstable.	Not yet evidence of robust deployment in complex multi-agent workflows.
Generated policy appendix	Implementation transparency	The learned policies are inspectable and explainable as code.	Readable code is not automatically safe, optimal, or maintainable.
Concurrent work discussion	Positioning and novelty boundary	The paper acknowledges related open-source game work and distinguishes PIBR through MARL framing, textual-gradient optimization, and task scope.	The authors themselves frame this version as a technical report rather than a formal novelty claim.

This table matters because the paper’s strongest business relevance does not come from claiming a benchmark breakthrough. It comes from the architecture of the method: visible policy, generated response, executable test, utility feedback, revision.

That architecture maps surprisingly well to enterprise automation.

The business value is inspectable coordination, not agent friendship

The lazy interpretation is that this paper teaches AI agents to cooperate. The better interpretation is that it makes cooperation more inspectable.

In many business processes, agents will not be strangers in a public game. They will be components inside the same organization: sales forecasting, procurement, legal review, customer support, workflow routing, fraud screening, inventory allocation. In those settings, policy disclosure is not absurd. It may be exactly what governance requires.

If one agent’s decision logic can be represented as a programmatic policy, another agent can adapt to it before acting. A supervisor system can inspect both. Unit tests can verify safety constraints. Utility feedback can evaluate whether the joint workflow improves. Humans can review policy code, at least in principle.

That leads to a practical design pattern:

Shared policy registry
        ↓
Agents publish executable policy modules
        ↓
Other agents condition on visible policy modules
        ↓
Sandbox tests check validity, constraints, and runtime behavior
        ↓
Workflow simulations estimate joint outcomes
        ↓
Best policy profile is selected, versioned, and monitored

This is not how most “multi-agent” demos are built today. Many demos rely on natural-language role prompts, message passing, and ad hoc coordination. That can work for prototypes. It is less convincing for production systems where failure costs money, violates rules, or quietly compounds over time.

PIBR suggests a more disciplined alternative: make agent policies explicit objects. Let agents read them. Let tests punish broken code. Let simulations punish poor coordination. Then version the policy profile that performs best.

For ROI, the immediate value is not replacing human managers with a swarm of tiny executives. Please, no. The immediate value is cheaper diagnosis of coordination failure. When a multi-agent workflow fails, a firm can ask:

Diagnostic question	Why programmatic policies help
Did the agent misunderstand the other agent’s role?	The conditioning prompt and generated code reveal what it assumed.
Did the policy crash or violate interface expectations?	Unit tests and runtime traces expose the failure directly.
Did coordination fail despite valid code?	Utility trajectories show where joint behavior underperformed.
Did a later revision degrade performance?	Historical policy profiles can be compared and rolled back.
Can humans audit the decision logic?	Source code is more inspectable than dense policy weights.

This is where the paper becomes relevant to Cognaptus-style automation. The point is not that every business agent should be a Python function generated by an LLM. The point is that policy artifacts should be inspectable, testable, and conditionable. Source code is one strong candidate. Structured rules, typed workflows, and domain-specific policy languages may serve similar roles in production.

What the paper directly shows, and what we can only infer

It is useful to draw a hard line between the paper’s evidence and the business inference.

The paper directly shows that PIBR can solve three small coordination matrix games and can produce promising but unstable behavior in a modified cooperative foraging environment. It directly argues that source-code policies reduce the representational bottleneck that blocks direct policy conditioning in neural MARL. It directly implements an LLM-based best-response loop using utility feedback and runtime/unit-test feedback.

Cognaptus infers that this architecture is relevant for enterprise agent orchestration because many business automation environments are cooperative, controlled, and policy-disclosure-friendly. In such settings, visible policy modules could reduce the need for fragile behavioral inference and make multi-agent failures easier to diagnose.

What remains uncertain is substantial. The paper does not establish robustness in large workflows. It does not test adversarial agents. It does not solve proprietary policy disclosure between competing firms. It does not show security guarantees for generated code. It does not prove that LLM-interpreted policy semantics are reliable under prompt manipulation, ambiguous code, or deliberately deceptive policies.

The difference is not a minor academic caveat. It decides where the method should be tried first.

Good early use cases would have several properties:

Suitable early setting	Reason
Cooperative agents inside one organization	Policy disclosure is acceptable and incentives are aligned.
Sandboxed workflow simulation	Generated policies can be tested before execution.
Clear utility metrics	The feedback loop needs measurable outcomes.
Typed interfaces and constrained actions	Unit tests can catch invalid behavior.
Human review of policy versions	Inspectability can actually be used, not merely claimed.

Bad early use cases are almost the mirror image: adversarial negotiation, open internet agents, untrusted code execution, hidden proprietary strategies, high-stakes irreversible actions, and workflows where utility is vague or politically contested.

The paper is interesting precisely because it suggests a disciplined architecture for the first category. It does not rescue the second.

Policy disclosure is powerful only when the governance model is honest

There is also a governance issue hiding in plain sight.

Policy-conditioned cooperation assumes that agents can read each other’s policies. But “readable” is not the same as “safe.” A policy can be readable and still malicious. It can be readable and still too complex for an LLM to interpret correctly. It can be readable and still exploit the interpreter. Once source code becomes part of the strategic environment, code itself becomes a signaling device.

The paper’s experiments avoid the nastiest version of this problem by focusing on cooperative environments. That is appropriate for a first technical report. But production systems need an additional layer: policy validation.

At minimum, a business version of this architecture would need:

A policy schema, so agents know what a valid policy module is allowed to contain.
A sandbox, so generated code cannot freely touch external systems.
Static and dynamic tests, so policies are checked before deployment.
Version control, so policy profiles can be compared, rolled back, and audited.
Separation between policy reading and authority, so understanding another policy does not automatically grant permission to act.
Monitoring after deployment, because simulated coordination can still fail under live distribution shifts.

This is where the paper’s unit-test feedback becomes more than an implementation detail. It hints at a broader governance pattern: agent policies should not merely be prompted; they should be compiled, tested, simulated, and approved.

Yes, that sounds less futuristic than “autonomous AI organization.” It also sounds less likely to bankrupt someone by Wednesday.

The deeper shift is from communication to commitment

Most multi-agent LLM systems emphasize communication. Agents chat. Agents debate. Agents negotiate task plans. That can improve performance, but it leaves a serious ambiguity: messages are cheap. A natural-language statement of intent is not the same as a policy commitment.

Programmatic policies are different. They expose executable decision logic. An agent does not merely say, “I will coordinate with you.” It publishes the function that determines how it will act. Another agent can then condition on that function.

This changes the social contract between agents. In a chat-based system, coordination is built on claims. In a policy-conditioned system, coordination is built on inspectable commitments.

The paper does not fully solve commitment. It uses LLMs as approximate semantic interpreters rather than formal proof engines. That means the method is more practical than classical program equilibrium, but also less rigorous. The LLM can misunderstand code. The generated best response can overfit to feedback. Later iterations can degrade.

Still, the direction is valuable. For enterprise AI, the future of multi-agent systems may depend less on whether agents can produce more persuasive messages and more on whether they can expose reliable decision artifacts.

A policy that can be read, tested, and versioned is easier to govern than a personality prompt with a job title.

The article’s bottom line

The paper’s contribution is not that LLM agents have learned universal cooperation. The experiments do not support that, and the foraging result explicitly warns against it.

The contribution is more precise: representing policies as source code lets agents condition on each other’s decision logic, and PIBR turns that visibility into an iterative best-response process using executable feedback. In small coordination games, that works very well. In a more complex cooperative grid-world, it produces meaningful but unstable behavior. That instability is not a reason to ignore the paper. It is the part that tells us the method is real enough to strain.

For businesses, the lesson is not to deploy code-reading agents into every workflow. The lesson is to redesign agent orchestration around inspectable policy artifacts. If agents are going to coordinate, their policies should not live only as hidden weights, vague prompts, or post-hoc logs. They should be objects that other agents can read, tests can challenge, and humans can audit.

That is a less magical view of agentic AI. Conveniently, it is also the one more likely to survive contact with operations.

Cognaptus: Automate the Present, Incubate the Future.

Yue Lin, Shuhui Zhu, Wenhao Li, Ang Li, Dan Qiao, Pascal Poupart, Hongyuan Zha, and Baoxiang Wang, “Policy-Conditioned Policies for Multi-Agent Task Solving,” arXiv:2512.21024, 2025. https://arxiv.org/abs/2512.21024 ↩︎

The hard part is not choosing an action; it is choosing against another chooser#

Neural policies are hard to read because weights are not policies in human form#

PIBR turns “read the other policy” into an iterative code-repair loop#

The matrix games show clean coordination, but they are not the whole story#

The foraging task is the real stress test, and it is unstable#

The paper’s evidence is best read by test purpose, not by headline#

The business value is inspectable coordination, not agent friendship#

What the paper directly shows, and what we can only infer#

Policy disclosure is powerful only when the governance model is honest#

The deeper shift is from communication to commitment#

The article’s bottom line#