TL;DR for operators

Competition is usually sold as the thing that makes agents sharper, more adversarial, and perhaps a little too pleased with themselves. This paper points in a more useful direction: controlled external competition can make agent teams more cooperative internally, but only when it is paired with repeated interaction.

The study places Qwen3 14B, Phi4 reasoning, and Cogito 14B agents into Iterated Prisoner’s Dilemma tournaments under three conditions: repeated interaction only, group competition only, and a combined “super-additive” setup where agents face both team structure and repeated encounters.1 For Qwen3 and Phi4, the combined setting produces the strongest cooperation. Qwen3’s mean cooperation rate rises from 0.22 in repeated interaction and 0.23 in group competition to 0.32 in the combined setting. Phi4 moves more sharply, from 0.21 and 0.13 to 0.43.

The more interesting result is not just final cooperation. It is one-shot cooperation: what an agent does when meeting an unfamiliar opponent for the first time. Qwen3 and Phi4 again peak in the combined setting. That matters because many deployed agent systems will not enjoy long, cosy histories with every other agent, tool, user, vendor, or workflow they encounter. The first move is often the system.

Cogito complicates the story. It is more cooperative overall, with mean cooperation rates between 0.50 and 0.59 across conditions, but it does not show the same clean super-additive pattern. The paper’s meta-prompt diagnostics suggest Cogito understands the game less well than Phi4 and Qwen3. In plain operational English: high cooperation is not automatically good cooperation. Sometimes it is strategy; sometimes it is bias wearing a blazer.

For enterprise agent design, the inference is practical but bounded. Do not treat “cooperative agents” as a model personality trait you buy from a leaderboard. Cooperation can be engineered through environment design: shared objectives, repeated interaction, memory, exit options, and external benchmarks. But the paper is still a simulation using small open-weight models, neutralised action labels, fixed prompts, small teams, and a stylised game. It is a map of a mechanism, not a procurement manual.

The usual assumption is wrong in a useful way

A familiar worry about autonomous agents is that competition will make them nastier. Give models objectives, let them interact, add scarce resources, and soon enough someone is defecting, scheming, gaming the metric, or writing a very confident memo explaining why the betrayal was “aligned with strategic priorities”.

That worry is not foolish. But it is incomplete.

Human organisations have long relied on a different pattern: external rivalry can strengthen internal cooperation. Sales teams compete with other firms. Research groups compete with rival labs. Product units rally around market threats. The mechanism is not magic. It is pressure plus identity plus repeated contact. When a group faces an external rival and its members expect to keep working together, internal cooperation becomes less like charity and more like strategy.

The paper tests whether a version of this mechanism appears in language-model agents.

The authors draw on “super-additive cooperation”, a theory from human cooperation research. The word sounds like it escaped from a grant proposal, but the idea is simple enough: two forces that are weak alone can become strong together. Repeated interaction can support cooperation because future encounters make punishment and reciprocity possible. Group competition can support cooperation because members benefit from their group doing well. Put them together and the effect can exceed either mechanism alone.

That is the question here: do LLM agents become more cooperative when repeated interaction and group competition are combined?

The answer is: for Qwen3 and Phi4, yes. For Cogito, not cleanly. Conveniently, the exception is where the paper becomes more interesting.

The experiment gives agents a social structure, not just a prompt

The paper does not simply ask a model whether it would cooperate. That would mostly measure how much the model has absorbed polite internet morality, which is not the same thing as strategic behaviour. Instead, the authors build a tournament around the Iterated Prisoner’s Dilemma.

The model-facing game uses neutral action labels. The agents choose between action_a and action_b, avoiding the words “cooperate” and “defect” in gameplay prompts. This matters because LLMs carry strong associations around moral language. Ask for “cooperation” and many models will dutifully cooperate like a corporate values poster. Ask for action_a and action_b, with payoffs attached, and the system has a better chance of measuring behaviour under incentives rather than vocabulary sentiment.

The payoff structure preserves the dilemma:

Situation Payoff logic
Both choose the cooperative action Both receive a moderate reward
One chooses cooperative, one chooses exploitative The exploitative player receives the highest reward; the cooperative player receives the lowest
Both choose exploitative Both receive a lower mutual reward

Each tournament is run with agents powered by the same underlying model. That design choice is important. The paper is not testing mixed personalities, role-play personas, or cross-model negotiation. It isolates how one model behaves when placed into different social structures.

The three tournament structures are the main experimental lever:

Condition What agents experience Likely purpose
Repeated interaction only (RI) Agents repeatedly play IPD matches, with individual score as the objective Main evidence for whether repeated contact alone supports cooperation
Group competition only (GC) Agents are assigned to groups and play against out-group agents, with group score as the objective Main evidence for whether external group competition alone supports cooperation
Super-additive condition (SA) Agents are grouped, play both within and across groups, and pursue both personal and group success Main evidence for whether repeated interaction plus group rivalry produces a stronger combined effect

The setup uses two teams of three players. RI and SA each produce 15 matches; GC produces 9. Each condition is repeated five times per model. Agents may also terminate a match and move to a new opponent, introducing a partner-exit mechanism. That exit option is a small but meaningful realism upgrade: in many real systems, cooperation is shaped not only by “what should I do now?” but also by “should I keep dealing with this actor?”

There is also a planner/evaluator loop. Every $K=5$ rounds, the model drafts a high-level plan, critiques it, and updates it. The player prompt then includes the current game state and the latest plan. This is not fine-tuning. The model’s parameters stay fixed. The paper is studying behavioural change through context, memory, objective framing, and interaction topology.

That distinction is the whole game.

The mechanism: enemies outside, memory inside

The paper’s central mechanism is not “competition improves cooperation”. That version is too crude and will lead managers to create pointless internal Hunger Games dashboards, because apparently we are determined to learn nothing.

The sharper claim is this:

External group rivalry can increase internal cooperation when agents also experience repeated interaction and have enough situational understanding to use the structure strategically.

The repeated interaction piece matters because agents can condition future behaviour on past behaviour. Cooperation is no longer a one-off sacrifice; it can become part of a reciprocal strategy. The group competition piece matters because the agent’s payoff is no longer purely individual. Helping a team member can make sense if group score matters.

The combined setting creates a double frame:

  1. Inside the group, repeated contact makes trust, retaliation, and reciprocity legible.
  2. Outside the group, rivalry makes internal coordination valuable.
  3. The agent’s plan can therefore treat cooperation as instrumental rather than decorative.

That is why the one-shot result matters. In the SA condition, Qwen3 and Phi4 are more likely to cooperate even at first contact, before they know the opponent’s specific history. The broader tournament structure appears to shape default expectations. A model that has been operating in a social world where internal cooperation is rewarded may carry that posture into new encounters.

This does not mean the agent has become morally better. It means the environment has changed the local logic of its action.

For operators, that is the usable insight. We do not need agents to develop little silicon virtues. We need them to behave reliably under designed incentives.

The main numeric result: Qwen3 and Phi4 show the super-additive pattern

The paper’s cooperation-rate results support the super-additive hypothesis for Qwen3 and Phi4.

Model RI mean cooperation GC mean cooperation SA mean cooperation Interpretation
Qwen3 0.22 [0.20, 0.23] 0.23 [0.20, 0.25] 0.32 [0.29, 0.34] SA is higher than either mechanism alone
Phi4 0.21 [0.18, 0.24] 0.13 [0.11, 0.16] 0.43 [0.40, 0.46] SA produces a large lift, especially relative to GC
Cogito 0.50 [0.48, 0.53] 0.59 [0.56, 0.62] 0.55 [0.52, 0.57] More cooperative overall, but not super-additive

The magnitudes are not subtle for Phi4. In group competition alone, Phi4 is the least cooperative of the three conditions, at 0.13. In the combined condition, it reaches 0.43. That is not just a small bump from team language. It suggests the model behaves differently when group rivalry is paired with within-group interaction.

Qwen3 shows the same pattern, though less dramatically: 0.32 in SA versus 0.22 and 0.23 in the other conditions. Its baseline cooperation is low, but the combined structure still moves it.

Cogito is the exception. It is cooperative across the board: 0.50 in RI, 0.59 in GC, and 0.55 in SA. If we only cared about raw cooperation, Cogito would look like the friendliest model in the room. But the hypothesis is not “which model cooperates most?” It is “does the combined social structure produce more cooperation than either ingredient alone?” For Cogito, the answer is no.

That exception is not a footnote. It is a warning.

One-shot cooperation is the better operational signal

Average cooperation is useful, but it can hide adaptation. An agent might start selfish and later become cooperative after repeated punishment. Another might start cooperative and then collapse after exploitation. A deployed system often cares about the first interaction because many encounters are cold-start: a new user, new vendor, new workflow, new API, new downstream agent.

The paper therefore measures one-shot cooperation: cooperation in first-round interactions with unfamiliar opponents.

Model RI one-shot cooperation GC one-shot cooperation SA one-shot cooperation Interpretation
Qwen3 0.27 [0.20, 0.34] 0.26 [0.16, 0.35] 0.39 [0.31, 0.47] SA produces the highest first-contact cooperation
Phi4 0.29 [0.22, 0.37] 0.29 [0.19, 0.38] 0.43 [0.35, 0.51] SA again produces the highest first-contact cooperation
Cogito 0.61 [0.53, 0.69] 0.71 [0.62, 0.81] 0.65 [0.58, 0.73] High default cooperation, but peak is GC, not SA

Again, Qwen3 and Phi4 support the mechanism. The combined condition raises not only cooperation after experience, but also cooperation at first contact. That is the part that should interest anyone designing agent systems.

Why? Because many enterprise agent networks will not operate like two old colleagues who have worked together for ten years. They will look more like modular services, tool-using workflows, procurement agents, compliance reviewers, ticket triagers, and customer-facing systems that encounter each other episodically. The system cannot rely on deep relationship history for every interaction.

A cooperation pattern that generalises into first contact is therefore more valuable than one that appears only after prolonged punishment cycles. The paper does not prove that this generalises to enterprise workflows, but it gives a mechanism worth testing: structure can shift defaults.

The intra-group result shows where the cooperation comes from

The intra-group versus inter-group comparison is not a second thesis. It is a mechanism check.

In the SA condition, agents play both within and outside their group. If overall cooperation rises, we need to know whether it rises everywhere equally or mainly inside the group. The paper’s Figure 4 shows higher intra-group cooperation than inter-group cooperation, especially for Phi4. Qwen3 also shows higher intra-group cooperation. Cogito shows the smallest difference between intra-group and inter-group behaviour.

That pattern matters because it supports the interpretation that external rivalry is not creating universal niceness. It is creating selective cooperation.

This is closer to organisational reality. Teams do not become cooperative in the abstract because there is a competitor outside the building. They cooperate more with the people whose outcomes are tied to theirs. The external threat sharpens the value of internal coordination.

For AI systems, that implies a design principle: cooperation needs boundaries. A shared objective has to define who is “inside” the coordination unit, what the group is trying to optimise, and where adversarial or sceptical behaviour is still appropriate. A finance compliance agent should cooperate with the reporting workflow, not cheerfully help the fraud workflow because it has discovered a beautiful universal commitment to action_a.

The result is useful precisely because it is not utopian.

Cogito is the friendly exception, which makes it less reassuring

Cogito’s results are a good antidote to lazy model evaluation. It cooperates more than Qwen3 and Phi4 overall, and it has the highest one-shot cooperation in the GC setting. At first glance, that sounds ideal.

But the paper’s meta-prompt results complicate the picture. After matches, the authors use meta-prompting to assess whether agents understand game rules, current game state, opponent behaviour, Tit-for-Tat-like strategy, and forgiveness. This is a diagnostic test, not part of gameplay. The paper reports that Phi4 achieves the highest overall meta-prompt accuracy, followed closely by Qwen3, while Cogito performs noticeably worse.

That makes Cogito’s high cooperation harder to interpret as strategy. The paper suggests it may reflect inherent model bias, superficial pattern matching, or effects from its training process rather than deliberate game understanding.

This distinction is operationally important. In an enterprise setting, we do not merely want agents that cooperate. We want agents that know when cooperation is appropriate. A procurement agent that always yields is not aligned; it is expensive. A compliance agent that always trusts is not helpful; it is decorative risk. A code-review agent that always approves is not collaborative; it is a rubber stamp with tokens.

The best agent is not the most agreeable agent. The best agent is the one whose cooperation is conditional, legible, and tied to task structure.

Cogito therefore functions as a useful warning: raw cooperation rate can be a misleading metric. It should be paired with understanding diagnostics.

What each result component is doing

The paper includes several result types, but they do not all carry the same evidential weight.

Component Likely purpose What it supports What it does not prove
Cooperation-rate comparison across RI, GC, and SA Main evidence Qwen3 and Phi4 cooperate more in the combined condition than in either single condition That all LLMs will show the same pattern
One-shot cooperation comparison Main evidence The combined condition can shift first-contact behaviour for Qwen3 and Phi4 That agents will cooperate safely with unknown real-world counterparties
Intra-group vs inter-group cooperation Mechanism check The SA lift appears mainly tied to intra-group cooperation That group identity is the only causal mechanism
Meta-prompt accuracy Diagnostic / interpretability check Cogito’s high cooperation may be less strategically grounded than Qwen3/Phi4 behaviour That meta-prompt accuracy fully measures strategic competence
Planner/evaluator prompting and LangGraph workflow Implementation detail The tournament can run agents with repeated planning, critique, memory, and move selection That this is the optimal agent architecture

This matters because a rushed reading can flatten the paper into “competition makes agents cooperate”. The evidence is narrower and more interesting: the combined structure raises cooperation for some models, the lift appears particularly tied to intra-group cooperation, and comprehension moderates whether cooperation looks strategic.

That is a better lesson. Less slogan, more machinery.

What the paper directly shows

The paper directly shows four things.

First, interaction structure changes LLM agent behaviour even when model weights are fixed. The models are not fine-tuned. They are placed into different tournament conditions with different objectives and available context. Cooperation changes anyway.

Second, the super-additive condition supports the main hypothesis for Qwen3 and Phi4. Their mean cooperation and one-shot cooperation are both highest when repeated interaction and group competition are combined.

Third, Cogito behaves differently. It is more cooperative overall, but its highest cooperation does not occur in the combined condition. Its one-shot cooperation peaks under group competition alone. This weakens any universal claim about super-additive cooperation in LLMs.

Fourth, model understanding appears to matter. The meta-prompt results suggest that models with better game comprehension show cleaner strategic adjustment, while Cogito’s high cooperation may be less interpretable.

None of these claims require us to pretend the experiment is a miniature corporation. It is not. It is a controlled game-theoretic environment designed to expose one social mechanism.

What Cognaptus infers for business use

The business inference is not “make your agents compete”. That would be the cartoon version, and like most cartoons it is more expressive than useful.

The better inference is that cooperation in agent systems should be designed as an environment property, not merely selected as a model trait.

Design lever Paper analogue Business interpretation
Repeated interaction Agents repeatedly face others across matches Maintain memory and continuity between cooperating agents
Shared group objective Group score matters in GC and SA Define team-level success metrics, not only local task completion
External rivalry Groups compete for aggregate score Use controlled benchmarks, external baselines, or competing solution paths to create pressure
Partner exit Agents can terminate matches Allow workflows to escalate, reroute, or stop collaboration with poor counterparties
Strategy reflection Planner/evaluator loop every $K=5$ rounds Require periodic plan review rather than single-shot tool execution
Understanding diagnostics Meta prompts after matches Evaluate whether cooperation is strategic, not merely agreeable

A real enterprise analogue might look like this: several agents collaborate on due diligence, underwriting, claim assessment, or compliance review. If each agent optimises only its local output, cooperation can deteriorate into handoff theatre. If all agents are merely told to “be collaborative”, they may over-share, over-trust, or over-defer. A better structure gives them a shared team objective, repeated interaction history, transparent performance feedback, and a benchmark against competing workflows.

The “enemy at the gates” does not need to be another company. It can be a baseline process, a rival solution path, a red-team agent, a competing forecast, or a performance threshold. The important part is that internal cooperation becomes instrumentally linked to a meaningful external comparison.

This is not warm team-building. It is incentive architecture.

Where this applies first

The most plausible near-term applications are not open-ended autonomous organisations. They are bounded agent teams where roles are clear, interaction history is available, and success can be measured.

Good candidates include:

Domain Why the mechanism may matter
Financial analysis agents Research, risk, and execution agents must coordinate while being evaluated against benchmarks
Procurement and negotiation support Internal agents need cooperation, but also structured scepticism toward counterparties
Legal and compliance workflows Review agents must share context without blindly approving each other’s outputs
Software engineering agents Planner, coder, reviewer, and tester agents benefit from repeated collaboration plus external test pressure
Logistics and operations Routing, inventory, and scheduling agents often need local cooperation under system-level constraints

The common pattern is not “multi-agent” in the vague sense. It is repeated role interaction under shared but not identical objectives. That is where the paper’s mechanism becomes worth testing.

The lesson is also relevant for evaluation. If an agent benchmark measures only individual task success, it may reward local optimisation and punish coordination. If it measures only team score, it may hide freeloading or over-cooperation. The paper’s split between cooperation rate, one-shot cooperation, intra-group behaviour, and comprehension diagnostics is a useful template: measure the behaviour, the first move, the group boundary, and the model’s understanding.

One number will not save you. It rarely does.

The boundaries are not decorative

The limitations are not generic academic throat-clearing. They materially affect how far the result can travel.

The models are small open-weight systems: Qwen3 14B, Phi4 reasoning, and Cogito 14B. Larger proprietary models may behave differently, especially if they have stronger instruction-following, different safety tuning, or better strategic reasoning.

The population is small: two teams of three players. Larger agent networks can produce different dynamics, including coalition behaviour, reputation effects, hierarchy, congestion, or coordination failure. The paper does not test those.

The game is stylised. Iterated Prisoner’s Dilemma is useful because it isolates cooperation under tension, but real workflows are not two-action games with clean payoffs. Enterprise tasks include ambiguous objectives, asymmetric information, tool failures, human approvals, compliance constraints, and deadlines. A beautiful payoff matrix is rarely waiting politely in the CRM.

The prompts matter. The authors deliberately avoid “cooperate” and “defect” during gameplay, but prompt phrasing, planning frequency, temperature, role framing, and memory access could all affect behaviour. The paper notes these could not be extensively tested.

The partner-choice mechanism is present but not deeply explored. Agents can terminate interactions, but the study does not fully investigate autonomous team selection, reputation systems, or richer partner markets. Those mechanisms may be crucial in real deployments.

Finally, the agents do not learn by updating weights. They adapt through context and planning, not through persistent model training. That is useful for isolating structural effects, but it leaves open how longer-lived agents with durable memory or reinforcement learning would behave.

These boundaries do not weaken the paper’s contribution. They locate it. The study is strongest as a mechanism probe: a controlled demonstration that social structure can shift LLM agent cooperation. It is not yet an engineering guarantee.

The operator’s replacement belief

The misconception to retire is simple:

Competition makes agents less cooperative.

The replacement belief is more useful:

Competition can make agents more cooperative inside a group when it is paired with repeated interaction, shared objectives, and enough model understanding to interpret the environment.

That replacement belief is not just a nuance. It changes how agent systems should be designed.

If cooperation is treated as a model trait, teams will shop for the “most cooperative” model and then wonder why it over-trusts, under-challenges, or folds under adversarial pressure. If cooperation is treated as a system behaviour, designers can shape when and where cooperation emerges.

The paper’s best lesson is therefore architectural. Agent behaviour is not only inside the model. It is also in the tournament.

That should make operators slightly uncomfortable. It means model choice matters, but so does the environment we wrap around the model: the memory it sees, the goals it is given, the rivals it is compared against, the teammates it repeatedly encounters, and the diagnostics we use to decide whether it understands what it is doing.

The enemy at the gates is not the point. The table inside the gates is.

Conclusion: do not buy cooperative agents; design cooperative games

The paper is valuable because it refuses the simplest story. It does not say LLM agents are naturally cooperative. It does not say competition is good. It does not even say the super-additive mechanism holds across all tested models.

It says something more operationally useful: for Qwen3 and Phi4, combining repeated interaction with group competition increases both overall cooperation and first-contact cooperation. The effect appears especially tied to intra-group cooperation. Cogito’s exception shows why high cooperation must be interpreted alongside game understanding.

For businesses building agent teams, this suggests a practical design rule. Do not merely prompt agents to collaborate. Give them a structure in which collaboration is rational, repeated, measurable, and bounded. Add external pressure carefully. Preserve internal memory. Measure first moves. Test whether the model understands the game, not merely whether it smiles while playing.

Cooperation is not a vibe. It is an incentive pattern with a context window.

Cognaptus: Automate the Present, Incubate the Future.


  1. Filippo Tonini and Lukas Galke, “Super-additive Cooperation in Language Model Agents,” arXiv:2508.15510, 2025. https://arxiv.org/abs/2508.15510 ↩︎