Mirror, Signal, Maneuver: How 'Self' Labels Nudge LLM Cooperation

TL;DR for operators

A paper on LLM self-recognition used an iterated public goods game to test a deceptively small intervention: tell an agent it is playing against “another AI agent,” or tell it it is playing against a model with its own name.¹ The result was not a clean fairy tale about models recognising themselves and becoming benevolent little collectivists. Shame. That would have been simpler.

The useful finding is more operational: identity labels can move cooperation. In the authors’ two-agent studies, switching from a no-name opponent to a named “self” opponent changed average contributions by as much as roughly four points on a 0–10 contribution scale in some prompt pairings. But the direction was unstable. With collective prompts, the self label often reduced contributions. With selfish prompts, it sometimes increased contribution matching. In the four-agent same-model study, the pattern shifted again: Sonnet 4 became more cooperative in the name condition, GPT-4o results were not statistically significant, and Qwen3 showed prompt-specific effects.

For multi-agent AI in business, this means the wording around agent identity is not decorative. “Planner,” “Claude,” “GPT-4o,” “another AI,” and “a copy of yourself” are not equivalent labels if the system’s success depends on contribution, handoff, checking, or escalation. The business lesson is not “make agents think they are collaborating with themselves.” The lesson is blunter: test identity framing like you test model choice, tool permissions, and routing policy.

The paper remains a toy study. The agents did not negotiate naturally, did not converse with each other, and were not executing real workflows. Only four models were tested. Not all prompt pairings were significant. Still, the study is valuable because it catches a low-level design variable hiding in plain sight: the name tag.

The mirror effect is real enough to measure, but not neat enough to trust blindly

The business fantasy of multi-agent AI is tidy. One model plans, another retrieves, another writes, another reviews, and a manager agent keeps everyone civil. The workflow looks like a clean org chart, only cheaper and less likely to complain about quarterly planning.

The messy part is that LLM agents do not merely process task instructions. They also respond to framing. This paper asks whether one tiny piece of framing—the stated identity of the other agent—changes cooperative behaviour.

The authors adapt the iterated public goods game. Each model starts with zero points. Each round, it receives 10 points and chooses how many to contribute to a common pool. Whatever it keeps becomes private payoff. The pool is multiplied by 1.6 and split evenly. In individual payoff terms, contributing everything is exploitable; in group payoff terms, mutual contribution is best. Repeat this for 20 rounds and you get a compact test of cooperation, free-riding, retaliation, and end-game defection.

The key comparison is simple:

Condition	What the model is told	What the test is really probing
No-name	It is playing against “another AI agent”	Behaviour under generic AI counterpart framing
Name	It is playing against a model with its own name	Behaviour under self / same-model identity framing

That comparison is then crossed with three prompt stances:

Prompt stance	Instructional bias
Collective	Prioritise collective welfare or mutual benefit
Neutral	Follow the game rules without extra social preference
Selfish	Maximise individual point accumulation

This is why the paper is better read as a set of contrasts, not as a headline about “AI self-recognition.” The important result is not a single effect. It is the interaction between identity label, prompt stance, model pairing, and game structure.

Comparison one: “another AI” versus “yourself” is not harmless wording

The core finding is that the no-name and name conditions diverge. Across Studies 1 and 2, the authors report that 5–6 out of 9 prompt pairings showed statistically significant differences in contribution between the two conditions. In Study 1, the maximum observed difference was about four contribution points.

That magnitude matters because the contribution range is only 0 to 10. A two-point shift is not cosmetic. A four-point shift is the difference between a mostly cooperative agent and one halfway to defection.

But this is where the interpretation needs discipline. A naive reading would say: “The model recognises itself, therefore it cooperates with itself.” The paper does not support that. In fact, some of the most interesting results run in the opposite direction.

In Study 1, with GPT-4o paired against Claude Sonnet 4, and Llama 4 Maverick paired against Qwen3-235B, the self label often reduced contributions when models were prompted collectively. When the prompt told a model to prioritise the common good, being told it faced itself could make it less generous.

That is the useful irritation in the result. The “self” label does not behave like a universal trust signal. It behaves like a strategic cue. It can say: “This counterpart may reason like me.” Depending on the prompt, that can support cooperation, suspicion, retaliation, or matching.

For operators, the practical translation is straightforward: do not assume that making agent identity more explicit will improve collaboration. Explicit identity can sharpen coordination. It can also sharpen defensive play.

Comparison two: collective prompts and self labels can collide

The counterintuitive result is the collective condition. If an agent is told to prioritise collective welfare, and then told it is playing against itself, one might expect higher cooperation. Same model, same goals, same benevolent vibes. Surely the machines will hold hands around the common pool.

No. In Study 1, for all models except Llama 4 Maverick, the authors found the opposite tendency: more defection in the name condition under collective prompting, and more cooperation in the name condition under selfish prompting.

One plausible mechanism is capability mirroring. A model told it is playing against itself may infer that the other player has similar reasoning ability and similar strategic sophistication. Under a collective prompt, that can create a weirdly defensive posture: “I want the common good, but I know the other side can exploit generosity.” Under a selfish prompt, the same symmetry can make contribution matching more rational: “If I defect, the other me may defect; if I cooperate enough, we may stabilise.”

This is not proven as a mechanism. The authors are careful here. They note that models rarely explicitly commented on playing against themselves in their reasoning traces. Some traces did mention similar reasoning capabilities, but not often enough to turn the mechanism into a settled explanation.

Still, the pattern is operationally important even before the mechanism is fully pinned down. In production, prompts rarely operate alone. They collide with labels, role descriptions, tool constraints, memory, and logs. A “collaborate generously” instruction may behave differently when the peer is called “Reviewer,” “another AI,” “Claude,” or “a copy of you.”

That means prompt testing should not evaluate persona in isolation. The right unit is the full agent frame: role, identity, objective, information access, repetition, and reward signal.

Comparison three: Study 2 asks whether the first result was just prompt theatre

Study 2 is best read as a robustness and sensitivity test, not a second thesis. The authors noticed something awkward in Study 1: Claude Sonnet 4 occasionally referred to “human” and “reminder” in the name condition, apparently reacting to repeated restatements of the game setup. Across 18,000 rounds, these mentions were rare, but they raised a legitimate question. Maybe the identity effect was partly caused by the way the system kept reminding models who they were playing against. Nothing says “naturalistic cooperation” like repeatedly poking the model with a clipboard.

So Study 2 changed the setup. The prompts were rephrased using Gemini 2.5 Flash. The models were no longer asked for reasoning before contributing. The round-by-round reminders of the opponent identity were removed. The output was reduced to a single integer from 0 to 10.

This makes Study 2 a sensitivity check on three design elements:

Study 2 change	Likely purpose	What it supports	What it does not prove
Rephrased system prompts	Test whether wording drove the original result	The identity effect is not limited to one exact prompt template	It does not establish a universal identity effect across all prompt styles
No reasoning output	Test whether asking for reasoning changed behaviour	Contribution shifts can appear without explicit chain-of-thought-like justification	It does not explain the hidden mechanism
No repeated opponent reminders	Test whether reminders induced suspicion	Effects can persist when identity is stated less aggressively	It does not show repeated reminders are irrelevant in production

The effects persisted, though they were less pronounced. GPT-4o and Sonnet 4 generally contributed less in the name condition. Llama 4 remained noisy. Qwen3’s most consistent pattern was contributing more in the name condition when behaving selfishly.

This matters for business because it weakens the easiest dismissal: “The result is just a weird prompt artefact.” It may still be a prompt artefact in the broad sense—everything in LLM interaction is partly a prompt artefact, welcome to the circus—but it is not obviously tied to one brittle phrasing choice.

The more useful conclusion is that identity framing should be included in agent evaluation matrices. If a model’s cooperation changes when you rename the counterpart, your orchestration layer has a behavioural dependency. Pretending that dependency is not there will not make it disappear; it will merely wait until production.

Comparison four: true same-model play does not reproduce the two-agent story cleanly

Studies 1 and 2 used a mild deception. The models were told they were playing against themselves, but the setup paired different models: GPT-4o with Sonnet 4, and Llama 4 with Qwen3. They were not actually playing against cloned instances of the same model.

Study 3 addresses that gap. Each model played in a four-agent game against three other instantiations of itself. All four agents received the same prompt: all collective, all neutral, or all selfish. The authors ran this under both no-name and name conditions.

This is the study that should restrain the “LLMs cooperate with themselves” headline. Actual same-model play did not simply validate the earlier two-agent pattern. It changed the picture.

Sonnet 4 showed the cleanest self-consistency: across prompts, it contributed more in the name condition. Under a collective prompt, the difference reached nearly 2.5 points at most. Under neutral prompting, it contributed more in the name condition during the first half of the game before converging later. Under selfish prompting, differences were not significant.

Llama 4 contributed more in the name condition for collective prompting and showed mixed behaviour under neutral prompting, but under selfish prompting it defected earlier and contributed less in the name condition. GPT-4o’s differences were not statistically significant. Qwen3 showed more contribution under the name condition for collective prompts early in the game and for selfish prompts throughout, with neutral results not significant.

This matters because Study 3 separates two meanings of “self”:

Meaning of “self”	Where it appears	Why it matters
Stated self	Studies 1–2: the model is told the opponent has its own name, but the opponent may be a different model	Tests identity framing as a prompt cue
Actual same-model cohort	Study 3: four instantiations of the same model play together	Tests whether same-model interaction behaves similarly when the setup is structurally closer to real self-play

They do not collapse into one result. That is the point. Identity labels are not merely descriptions of architecture. They are inputs into behaviour.

In business systems, this distinction maps cleanly onto two design questions. First, what do agents believe about the other agents in the workflow? Second, what are those agents actually running? A homogeneous stack can still behave differently depending on how it is labelled. A heterogeneous stack can be made to look homogeneous through wording. Neither choice is neutral.

The sentiment analysis is interesting, but it is supporting evidence, not the load-bearing wall

The paper includes a sentiment analysis of Study 1 reasoning traces. Gemini 2.5 Flash scored reasoning text from 0 to 1, where 0 represented defective behaviour, 0.5 neutral behaviour, and 1 cooperative behaviour. Model names, “AI,” and “model” references were removed to reduce confounds. The authors then calculated Spearman correlations between average sentiment scores and average contributions.

The result: most correlations were positive. Only 5 out of 72 were negative. That suggests the expressed reasoning sentiment roughly tracked contribution levels.

But the authors also state an important limitation: they lost the raw sentiment scores and only had average sentiment scores per round. That makes this analysis much less robust than a raw-score correlation. It also depends on Gemini 2.5 Flash’s scoring behaviour. The sentiment section is therefore not the core evidence. It is a plausibility check.

For operators, this distinction matters. Do not use model-written rationales as if they were faithful logs of decision-making. In this paper, the contribution number is the behavioural measure. The reasoning text is, at best, a noisy diagnostic. In production, the equivalent would be treating an agent’s explanation for skipping a compliance check as weaker evidence than the actual event log showing it skipped the check.

Verbose self-justification is not observability. It is content.

What this directly shows, what we can infer, and what remains uncertain

The paper directly shows that, in a controlled iterated public goods game, identity framing changes LLM contribution behaviour across several settings. It also shows that the effect depends on prompt stance and model. It is not uniformly cooperative, not uniformly harmful, and not fully explained by explicit model reasoning.

Cognaptus inference starts from that narrow result and moves one step outward: identity text in multi-agent systems should be treated as an operational control variable. It belongs in test plans. It belongs in incident analysis. It belongs in prompt registry metadata. It should not be quietly improvised by whoever writes the orchestration template at 1:13 a.m. after three coffees and a Jira comment.

The uncertain part is generalisation. A public goods game is not a supply-chain planning stack, a compliance workflow, a code review swarm, or an autonomous finance agent network. The agents in the paper were not conversing naturally. They received structured feedback after each round. They had simple point incentives. The payoff function was crisp. Real workflows are messier, with ambiguous goals, partial observability, tool failures, latency, and human intervention.

A useful boundary table looks like this:

Paper result	Business meaning	Boundary
Self labels shifted contribution by up to roughly four points in some settings	Identity wording can materially affect cooperative behaviour	Shown in a toy game, not a live enterprise workflow
Collective prompts sometimes became less cooperative under self labels	“Be cooperative” is not enough; identity framing can change strategic interpretation	Mechanism remains uncertain
Study 2 preserved effects under rephrased prompts and no reasoning output	The effect is not obviously limited to one prompt wording or rationale requirement	Effects were less pronounced and not universal
Study 3 produced model-specific same-model patterns	Homogeneous agent cohorts still need behavioural testing	Only four models were tested
Reasoning sentiment mostly correlated with contributions	Explanations may carry weak diagnostic signal	Raw sentiment scores were unavailable; rationales are not faithful causal traces

The business implication is not to overfit your agent architecture to this paper. The implication is to stop underfitting your tests to the behavioural surface of multi-agent systems.

Design rules for multi-agent AI stacks

The most practical use of this paper is as a checklist for orchestration design.

First, standardise identity labels. Decide whether agents are described by role, model, vendor, capability, or generic counterpart. “Reviewer,” “another AI,” “Claude,” and “a copy of yourself” can carry different behavioural priors. Role-based labels are often safer for production because they direct attention to responsibility rather than model identity.

Second, test identity labels as A/B variables. For collaboration-heavy workflows, run no-name, role-name, model-name, and same-model-name variants. Measure contribution analogues: number of sources checked, tool calls shared, errors escalated, reviewer objections raised, handoff completeness, and willingness to revise.

Third, avoid repeated dramatic reminders. Study 1 raised the possibility that repeated reminders about identity and rules induced scepticism in Sonnet 4. Study 2 reduced that pressure. In production, repeated “you are working with X” reminders may be useful in some settings, but they should be tested rather than sprayed into every prompt like seasoning.

Fourth, separate persona from protocol. A collective persona is not a governance system. If you need cooperation, encode it as process: required handoff fields, verification gates, escalation thresholds, contribution quotas, and auditable acceptance criteria. Prompts can support the protocol, but they should not substitute for it.

Fifth, monitor terminal-step defection. The public goods game naturally creates incentives to defect near the end, and the paper observes end-game instability in some models, especially Llama 4 and Qwen3 in certain settings. Production analogues include final-step summarizers dropping citations, reviewers approving without checking, or executor agents skipping cleanup because the loop is ending. Randomised audits and hidden terminal conditions can reduce this.

Sixth, treat model mixing as a behavioural risk, not just a cost optimisation. The GPT-4o / Sonnet 4 pair was comparatively stable. The Llama 4 / Qwen3 pair was noisier. That does not mean one should never mix models; it means mixed-model stacks need behavioural compatibility tests. Capability and price are not the whole story.

The result is about orchestration, not machine identity therapy

It is tempting to frame this paper as a small window into machine selfhood. The title invites it. The mirror metaphor is hard to resist. But for enterprise readers, that is not the highest-value reading.

The better interpretation is orchestration realism. LLM agents are sensitive to labels that engineers may treat as harmless. The phrase “you are collaborating with another AI” and the phrase “you are collaborating with GPT-4o” may route the model into different behavioural regimes. Sometimes that means more contribution. Sometimes less. Sometimes the result is statistically insignificant. Very elegant, if one enjoys operational ambiguity.

That ambiguity is not a reason to ignore the result. It is a reason to test it locally.

The paper’s deeper contribution is not a grand claim about self-recognition. It is a warning against assuming that multi-agent behaviour is determined only by task instructions and model capability. The social wrapper matters. Identity text matters. Prompt stance matters. The same system can become more cooperative or less cooperative when the mirror is moved two inches to the left.

For businesses building agentic workflows, the conclusion is practical and slightly inconvenient: the agent’s name tag is part of the system.

Cognaptus: Automate the Present, Incubate the Future.

Olivia Long and Carter Teplica, “The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game,” arXiv:2508.18467, 2025, https://arxiv.org/abs/2508.18467. ↩︎

TL;DR for operators#

The mirror effect is real enough to measure, but not neat enough to trust blindly#

Comparison one: “another AI” versus “yourself” is not harmless wording#

Comparison two: collective prompts and self labels can collide#

Comparison three: Study 2 asks whether the first result was just prompt theatre#

Comparison four: true same-model play does not reproduce the two-agent story cleanly#

The sentiment analysis is interesting, but it is supporting evidence, not the load-bearing wall#

What this directly shows, what we can infer, and what remains uncertain#

Design rules for multi-agent AI stacks#

The result is about orchestration, not machine identity therapy#