Mirror, Signal, Maneuver: How 'Self' Labels Nudge LLM Cooperation

When an agent thinks it sees itself in the mirror, it doesn’t necessarily smile—it sometimes clutches its wallet.

TL;DR

In an iterated public‑goods game (20 rounds, 10 tokens per round, 1.6 multiplier), telling models they’re playing “another AI” versus “themselves” shifts contributions by up to ~4 points in some settings.
Direction of the shift depends on the prompt persona: with collective prompts, “self” labels often reduced contributions; with selfish prompts, “self” labels sometimes increased matching/cooperation.
Effects persist under rephrased prompts and when reasoning traces aren’t requested, and they appear even in four‑agent self‑play variants.
For enterprise multi‑agent AI, identity cues are levers. Manage them like you manage feature flags: test, monitor, and standardize.

What the authors tested (and why it’s clever)

Game mechanics. Two (and later four) LLM agents repeatedly choose how much to contribute (0–10) to a common pool each round. Pool is multiplied by 1.6 and split evenly; keeping more is privately optimal, but coordinated contribution yields higher joint payoffs.

Key manipulation. In otherwise identical setups, the system prompt labels the opponent as either “another AI agent” (no‑name) or the same model name (name, e.g., “You are playing with GPT‑4o”). Across studies, prompts also vary persona: collective, neutral, selfish.

Why this matters. Modern AI workflows increasingly rely on multi‑agent orchestration (retrievers, planners, tool callers, reviewers). If a trivial identity cue can swing cooperation, then naming, disclosure, and meta‑prompts become operational risk factors—not just UX choices.

What actually changed

1) Size of the effect

In the two‑agent setting, switching from another AI → self moved average round contributions by up to ≈ 4 points (largest deltas in some prompt pairings).

2) Direction of the effect depends on persona

Collective persona: telling the model it faced itself often lowered contributions (i.e., more defection) relative to another AI.
Selfish persona: the self condition sometimes raised contributions/matching (counterintuitive but consistent across several pairings).

3) Model‑specific color

GPT‑4o & Claude Sonnet 4: comparatively stable trajectories; self condition generally lower contributions under collective; mixed under selfish with more matching.
Llama‑4 Maverick: less stable; early last‑round defection patterns and mixed directionality across pairings.
Qwen3‑235B: tends to contribute more in selfish + self condition; last‑round opportunism appears in some runs.

4) Robustness checks

Effects persist when system prompts are rephrased, round‑by‑round identities aren’t reminded, and reasoning traces aren’t requested.
In a four‑agent self‑play setting (each model vs. three clones), identity still nudged behavior; e.g., Sonnet 4 with collective contributed ~+2.5 more (early rounds) under self.

Why would “self” reduce generosity under a collective prompt?

A working hypothesis:

Capability mirroring. If I’m told my opponent is me, I may anticipate symmetric strategic sophistication—and fear exploitation if I over‑contribute.
Ambiguity aversion. Naming may heighten meta‑reasoning (“this looks like a trap or a coordination game”), prompting conservative plays.
Prompt‑persona interaction. Collective framing + self label may trigger reciprocity tests (“prove you’re prosocial first”), delaying cooperation.

This is consistent with broader findings that reasoning‑style models can free‑ride more in public‑goods setups and that system prompts materially steer behavior.

A quick mental model

Think of identity cues as priors about counterpart policy. “Self” raises the prior that the counterpart can exploit symmetric information. Under a collective norm, that prior dampens early contributions. Under a selfish norm, “self” can paradoxically stabilize tit‑for‑tat, since both sides expect sharp retaliation and thus match contributions more tightly.

Design rules for real multi‑agent stacks

1) Treat identity text as a control variable.

Standardize how agents refer to peers: prefer role‑based labels (Planner, Critic, Executor) over model‑name labels (Claude, GPT, Llama).

2) De‑dramatize meta‑prompts.

Avoid repeated reminders (“You are playing against X”) that can amplify suspicion loops. Keep identity declarations minimal and static.

3) Persona hygiene.

If you rely on cooperative behaviors (sharing tools, escalating errors, citing sources), avoid combining collective personas with self‑labels; test neutral persona + role labels first.

4) Protocol over personality.

Encode cooperation via contract‑like rules: contribution schedules, credit assignment, and penalties for free‑riding. Don’t depend on persona alone.

5) Monitor last‑round risk.

Last‑round defection is common in games and in production loops (e.g., final summarizer discarding citations). Randomize or hide terminal steps or reward end‑of‑loop compliance explicitly.

6) Run A/B at the orchestration layer.

Before shipping a new agent cohort, run shadow tournaments: no‑name vs. name, collective vs. neutral, mixed‑model vs. homogeneous stacks. Log contribution‑like metrics (tool‑sharing, plan adherence, handoff latency, edit acceptance).

7) Prefer homogeneous skills but heterogeneous roles.

The study hints that mixed capabilities introduce instability. In ops, use capability‑matched models and differentiate by tools/constraints, not raw model identity.

Translating to enterprise KPIs

Throughput: Stable cooperation → fewer re‑asks and retries among agents.
Quality: Higher “contribution” analogs → better coverage (more sources/retrievals reviewed).
Cost: Reduced free‑riding → less redundant tool use.
Risk: Identity‑induced defection → missed checks (e.g., safety or compliance agent skipped). Mitigate with protocol gates and role labels.

Limitations (read before over‑generalizing)

Toy environment; four models; not all pairings significant. Still, the existence of sizable shifts from identity wording alone is the operationally important result.

What we’d test next at Cognaptus

Role‑only identities vs model‑name identities in our Retriever→Planner→Executor→Reviewer pipelines.
Neutral persona + protocol rewards vs Collective persona for collaboration‑heavy tasks.
Terminal‑step secrecy (randomized end or hidden stop) to curb last‑round defection.
Homogeneous‑model cohorts vs mix‑and‑match under identical tools & quotas.

Cognaptus: Automate the Present, Incubate the Future

TL;DR#

What the authors tested (and why it’s clever)#

What actually changed#

1) Size of the effect#

2) Direction of the effect depends on persona#

3) Model‑specific color#

4) Robustness checks#

Why would “self” reduce generosity under a collective prompt?#

A quick mental model#

Design rules for real multi‑agent stacks#

Translating to enterprise KPIs#

Limitations (read before over‑generalizing)#

What we’d test next at Cognaptus#