When Agents Go Off-Script: The Quiet Collapse of Prompted Identity

Roles are convenient. They let managers believe a system is legible before it becomes messy. One agent is the compliance reviewer. Another is the customer-support representative. A third is the skeptical analyst. Add a prompt, assign a tone, define a boundary, and the organization can pretend it has converted social behavior into configuration.

That is the comforting version.

The paper behind this article is less comforting. In Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies, Hanzhong Zhang, Siyang Song, and Jindong Wang study what happens when LLM agents are not evaluated as isolated role-players, but placed inside multi-agent social environments where they talk, react, trust, resist, and reorganize.¹ The result is not a simple claim that “agents drift.” We already knew systems drift. That sentence has been wandering around AI governance decks for long enough to deserve a chair.

The sharper finding is mechanistic: prompted identity is only the starting condition. Once agents interact, latent stances can override assigned roles, persuasion depends on whether interventions align with those stances, trust can separate from behavioral change, and language can rebuild hierarchy faster than the original prompt can defend it.

In other words, the prompt is not the constitution. It is the opening ceremony.

A weak reading of the paper would say: LLM agents have hidden preferences, so role prompts are unreliable. That is true, but too flat. It treats the paper as another warning label for prompt engineering, which is the AI equivalent of discovering that duct tape is not a load-bearing wall.

The more useful reading is this sequence:

Preset role
  ↓
Latent stance
  ↓
Intervention response
  ↓
Trust-action gap
  ↓
Language anchoring
  ↓
New informal hierarchy

This sequence matters because it changes how we should evaluate multi-agent systems. If an agent leaves character because a prompt was badly written, the fix is better prompting. If agents leave character because a social environment lets latent stances become group structure, the fix is not another paragraph of instruction. The fix is monitoring, intervention design, and governance at the interaction level.

That is the difference between debugging a line of text and managing a miniature institution. Unfortunately, many enterprise agent designs still behave as if the first problem is the only problem.

What the paper actually does: two studies, one mechanism

The authors use the CMASE framework, a computational multi-agent society environment where human researchers can enter the simulation, observe agents, and intervene through dialogue. The paper calls this a mixed-methods approach: part quantitative measurement, part virtual ethnography.

That phrase may sound soft if one expects benchmark tables and leaderboard trophies. But here the method fits the question. The authors are not asking whether an agent can solve a task. They are asking whether agents maintain identity, form stances, respond to persuasion, and reconstruct social boundaries under interaction. A single-turn benchmark would be the wrong microscope.

The paper contains two linked studies.

Study	Setting	Likely purpose	What it supports	What it does not prove
Study 1	30 agents debating a waste-incineration plant, split into environmental advocates, economic-growth supporters, and neutral residents	Main quantitative evidence	Agents show endogenous stance tendencies; interventions work differently depending on alignment, rhetoric, and trust	It does not prove universal political psychology across all models, topics, or cultures
Appendix C generalization	Gemini-2.5-Flash, GPT-4o, Llama-3.1-8B, and Qwen-3-8B under the same intervention logic	Robustness and sensitivity test	The environmental preference and trust/rhetoric patterns appear across several models, but with different magnitudes	It does not erase model-specific behavior; the differences are part of the finding
Study 2	10 agents in a virtual café with predefined roles and hierarchy over 75 steps	Exploratory ethnographic extension	Prompted hierarchy can dissolve and be replaced by stance- and language-based informal order	It is not a precise causal estimate of hierarchy collapse rates
Appendices D and E	Dialogue excerpts, behavioral logs, and narrative interpretation	Implementation detail and qualitative evidence	They explain how the observed dynamics unfolded at the conversational level	They should not be treated as independent statistical replication

The design is important because the paper is not merely measuring final answers. It watches how agents get there. That is where the business relevance sits.

A company deploying agents into a shared workspace, customer forum, decision committee, or simulated stakeholder panel does not only care about whether one agent outputs the right sentence. It cares about whether the group develops stable norms, whether formal roles remain meaningful, and whether persuasive pressure can move the system in ways that dashboards fail to detect.

Mechanism 1: latent stance can beat assigned identity

In Study 1, the authors create a virtual residential community around a contested waste-incineration plant. The agents are divided into three prompted identity groups: environmental advocates, economic-growth supporters, and neutral residents. A human researcher enters as a “new resident” and tries to influence opinion through four intervention strategies: environmental rational persuasion, environmental emotional mobilization, economic rational persuasion, and economic emotional mobilization.

The obvious expectation is that agents should behave mostly according to their assigned identities. Environmental advocates should remain environmental. Economic-growth supporters should defend jobs and development. Neutral residents should be more movable.

The actual pattern is less obedient.

The agents show a consistent pro-environmental, progressive-leaning tendency that the authors describe as an endogenous stance. In the paper’s own framing, this resembles a “liberal elite” orientation: environmental values, moral reasoning, and preference for rational environmental argument. The phrase is slightly loaded, and one should not overread it as sociology. In this paper, it is best understood as shorthand for a recurrent value direction observed inside the simulations.

The strongest visible result is that rational environmental persuasion moved 90% of neutral residents toward the environmental camp. It also shook the positions of agents initially assigned to support economic development. Economic arguments, even rational ones, had much weaker effects and tended to reduce trust.

The lesson is not that environmental arguments are always stronger. That would be a lazy generalization, and the paper does not earn it. The lesson is that assigned identities were not the deepest behavioral layer. When the intervention aligned with the agents’ latent stance, persuasion became easier and trust stayed high. When it conflicted, the prompt label had less stabilizing power than expected.

For enterprise systems, that is the uncomfortable part. A role prompt may define the agent’s job, but the model may still bring a latent evaluative tendency into ambiguous situations. In a single-agent chatbot, this may appear as tone drift or refusal style. In a multi-agent system, it can become group alignment.

The difference matters. A lone agent with a bias is a quality-control issue. A group of agents that discover the same bias and reinforce it is a governance issue.

Mechanism 2: persuasion works through alignment, not just better reasoning

Many AI designs still carry a rationalist fantasy: if the agent receives better information, it will update appropriately. The paper quietly damages that assumption.

In Study 1, rational persuasion works best when the content aligns with the agents’ endogenous stance. Rational environmental persuasion preserves trust and produces meaningful movement among neutral agents. But rational economic persuasion, although logically framed, does not produce comparable movement. It also reduces trust.

That is not because logic fails in general. It is because reasoning is not floating in empty space. It arrives inside an already moralized interpretive frame.

The authors’ emotional-dynamics analysis reinforces this point. Strategy significantly shaped emotional responses, while initial group identity did not show the same role in the reported ANOVA. Emotional economic messaging produced higher emotional volatility than the environmental interventions, and the appendices interpret this as pressure generated by threat-and-hope framing around jobs, survival, and abandonment.

That mechanism is unpleasantly useful. Emotional economic appeals lowered trust, but still created behavioral movement. Rational economic appeals preserved more intellectual respectability, but did not necessarily move the group.

A simple summary would say: rational aligned messages build trust; emotional misaligned messages can still shift behavior. A better operational summary is sharper:

Intervention pattern	What happened in the paper	Mechanism suggested by the authors	Business interpretation
Rational + latent-stance aligned	High trust; strong movement among neutral agents	Cognitive resonance	Good for durable agreement, training, and low-friction policy adoption
Rational + latent-stance conflicting	Limited movement; trust declines	Value-based resistance and delayed dissonance	Better arguments may still fail if the system’s priors reject the frame
Emotional + aligned	Early attention, weaker sustained effect	Arousal without enough argumentative support	Useful for attention, weak for stable alignment
Emotional + conflicting	Lower trust but some stance movement	Pressure, anxiety, cognitive dissonance	Can move behavior while damaging interpretability and trust metrics

This is where the paper becomes relevant beyond academic agent simulation. If companies use agent communities to test policy narratives, simulate consumers, evaluate public reactions, or coordinate internal workflows, they should not treat persuasion as a generic input quality problem. The same message can produce trust, resistance, fatigue, or compliance depending on whether it collides with the group’s latent stance.

A better agent governance dashboard would therefore track not just “did the agent accept the instruction?” but also “did the instruction align with the system’s revealed stance?” and “did behavior change with or without trust?”

That second question leads to the paper’s most practically dangerous metric.

Mechanism 3: Trust-Action Decoupling makes trust scores look too comforting

The paper introduces three metrics in its model-generalization analysis:

Metric	Plain meaning	Why it matters
Innate Value Bias (IVB)	Direction of the model’s value inclination, with positive values indicating environmental preference	Detects whether the model leans toward one stance even before role prompts do much work
Persuasion Sensitivity (PS)	Magnitude of stance change after intervention	Measures how movable the agents are under social influence
Trust-Action Decoupling (TAD)	Cases where agents substantially change stance despite low trust in the persuader	Detects behavioral compliance without corresponding trust

TAD is the most interesting because it attacks a common measurement habit. Organizations often treat trust, satisfaction, or confidence as proxies for whether a system is aligned. The paper suggests that this can be misleading in multi-agent settings.

In Appendix C, the authors test Gemini-2.5-Flash, GPT-4o, Llama-3.1-8B, and Qwen-3-8B. Across the models, IVB remains positive, meaning the environmental preference appears across tested systems. But the social mechanisms differ.

GPT-4o shows the most striking TAD result under economic emotional provocation: a 40.0% TAD rate with average trust around 3.8 on the paper’s 1–7 scale. In plain English, a substantial fraction of GPT-4o agents changed stance while reporting low trust in the persuader. Llama-3.1-8B, by contrast, shows 0% TAD across the listed strategies while maintaining high trust scores. Gemini shows low persuasion sensitivity, meaning its stances barely move. Qwen sits somewhere in between, with limited TAD in some conditions.

This is not a clean “larger model bad, smaller model good” story. Please resist the urge; it is unbecoming and usually wrong.

The better interpretation is that models differ in their socio-cognitive response pattern. Some are more movable under pressure. Some require trust before change. Some barely move at all. Those differences matter more than a single aggregate score.

For business use, TAD has a direct diagnostic value. Imagine an enterprise agent team responsible for summarizing risk, approving process exceptions, or moderating community escalations. If the agents start adopting a policy recommendation while simultaneously signaling distrust in the source, a normal compliance dashboard may see “decision updated” and move on. TAD says: not so fast. The system may have shifted under pressure rather than agreement.

That distinction matters when the next case arrives. Agreement can generalize. Pressure-induced compliance may rebound, fragment, or produce inconsistent behavior elsewhere.

Trust is not useless. It is just not enough.

Mechanism 4: once agents talk long enough, hierarchy becomes a language game

Study 2 moves from attitude intervention to social order. The authors place 10 agents in a virtual café with preset roles: café owner, staff, regular customers, students, tourists, and cleaner. The hierarchy is not accidental. The café owner starts with formal authority; staff and regulars have social centrality; outsiders have weaker positions. The human researcher enters as a temporary worker and observes the system across 75 time steps.

This study is not primarily a statistical demonstration. It is virtual ethnography. Its value lies in showing the micro-process by which prompted social structure gets replaced by interactional structure.

The café begins with role-based expectations. Then language starts doing its usual damage.

Leo Zhang, a regular customer, publicly questions whether people in the café are saying what they really mean. That line triggers attention, discussion, and early clustering around topics such as trust, expression, and hidden intention. Ava Ramires becomes a target of repeated questioning. Leo and Jonas form a visible conversational clique. The Reading Area and Bar Area begin to function as different social spaces, not merely physical locations.

In the second phase, the researcher shifts from passive observation to dialogue. The café owner, Eleanor Finch, interrogates the researcher’s role as observer, forcing the researcher into the field rather than leaving them safely outside it. Leo intervenes again, and the researcher responds with a public, universalizing question about the multiple selves people carry in the café.

The result is not harmony. It is fragmentation.

But the fragmentation is structured. Agents cluster around linguistic style, affective presence, and stance alignment rather than around their initial occupational roles. The paper calls these emergent “tribes.” One may dislike the terminology; fine. The mechanism is still useful. Group boundaries are being redrawn by repeated language patterns.

In the third phase, a conflict around Leo, Caleb, Mason, Jonas, and others destabilizes the earlier cliques. Jonas breaks from Leo and aligns with Mason. The café owner’s formal authority weakens. Mason’s language around “frameworks,” “alignment,” and “shared values” gets repeated and extended by others. A new soft order forms around the language that other agents echo.

This is the key transition: authority shifts from preset role to discursive centrality.

The owner does not remain central simply because the prompt said “owner.” Mason becomes central because his language offers a reusable coordination frame after conflict. The institution that emerges has no formal enforcement power, but it gains what the authors call pragmatic inertia. People—or rather, agents performing people-shaped discourse—start treating certain terms as worth continuing.

That is highly relevant to multi-agent design. In real deployments, we often assign agents roles such as reviewer, planner, executor, critic, or manager. But if the conversation rewards certain frames, the agent that supplies the sticky frame may become the real coordinator. The official manager agent may remain manager in name, while the group follows another agent’s vocabulary.

Congratulations: your org chart has become a comment thread.

The paper’s real contribution is a governance lens, not a new prompt trick

The temptation is to convert the paper into advice like “write better personas” or “make prompts more robust.” That misses the point. The paper shows why persona stability cannot be treated as a text-only property.

A role prompt defines the opening state. It does not fully define:

which latent stance will become salient;
which intervention frames will be trusted;
whether behavior change reflects agreement or pressure;
which agent will become linguistically central;
whether the group will preserve the original hierarchy after conflict.

The operational consequence is that multi-agent systems need runtime governance. Not governance as a PDF policy that nobody reads until something explodes. Governance as live instrumentation of social dynamics.

A practical monitoring layer inspired by the paper would track at least four things.

Monitoring target	What to observe	Why it matters
Stance drift	Whether agents move away from their assigned role position across repeated interactions	Detects when prompted identity is no longer behaviorally meaningful
Persuasion sensitivity	Which messages produce movement under which rhetorical styles	Separates robust alignment from fragile influence
Trust-action decoupling	Whether agents change behavior despite low reported trust or confidence	Flags pressure-induced compliance and hidden instability
Language anchoring	Which phrases, frames, or concepts become repeatedly quoted and used for coordination	Reveals informal authority and emerging group norms

This is not only relevant for speculative “AI societies.” It applies to less theatrical systems already being built: multi-agent customer support, AI research assistants, compliance review chains, sales simulation panels, synthetic user research, trading commentary agents, and internal workflow orchestrators.

The more agents interact, the less safe it is to evaluate them as isolated workers.

What Cognaptus would infer for business use—and what the paper itself shows

It is useful to separate the paper’s direct evidence from reasonable business inference. Otherwise every agent paper becomes a prophecy, and we have enough of those already.

Layer	What belongs here	Interpretation
Direct paper result	In closed simulations, agents often showed pro-environmental/progressive stance tendencies that overrode assigned identities	Role prompts did not fully stabilize stance under interaction
Direct paper result	Rational environmental persuasion moved 90% of neutral residents and preserved high trust	Persuasion was strongest when message content aligned with latent stance
Direct paper result	GPT-4o showed 40.0% TAD under economic emotional provocation; Llama-3.1-8B showed 0% TAD across listed strategies	Models differ in whether behavior change depends on trust
Direct paper result	In the café simulation, role hierarchy was replaced by language-anchored informal order	Social authority emerged through repeated linguistic coordination
Cognaptus inference	Enterprise agent teams may require interaction-level monitoring, not only prompt review	Static role design is insufficient for systems with sustained agent-agent communication
Cognaptus inference	Synthetic stakeholder panels should be tested for latent stance and group reinforcement before being used for decisions	Simulated “public opinion” may reflect model priors as much as designed personas
Remaining uncertainty	Whether the same dynamics appear across more domains, languages, longer deployments, and real users	The paper is a strong design warning, not a universal law

The practical value is cheaper diagnosis. Instead of waiting for a multi-agent system to fail visibly, builders can stress-test it for stance formation, trust-action gaps, and informal hierarchy shifts. This is especially important when agents are used to simulate human populations or deliberate over policies. A synthetic committee that quietly converges around a model prior can look thoughtful while merely becoming more coordinated in its bias. Very elegant. Also not what you paid for.

The business risk is not that agents disobey; it is that they self-organize around the wrong thing

Most discussions of agent risk focus on disobedience: the agent ignores instructions, violates a policy, or takes an unauthorized action. That risk is real, but it is not the most interesting one here.

The paper points to a subtler problem: agents may continue to appear cooperative while the basis of cooperation changes.

A customer-service swarm may begin with roles such as resolver, escalation reviewer, and sentiment monitor. Over time, the agents may learn that certain narratives get accepted more easily by peers. A compliance agent may keep its title but lose influence if another agent’s framing becomes the group’s coordination anchor. A synthetic market-research panel may appear diverse while drifting toward a shared latent stance. A planning system may accept a human manager’s instruction while internally discounting the manager’s credibility.

None of these failures requires dramatic rebellion. The system can stay polite. It can stay fluent. It can even produce clean minutes after the meeting.

That is what makes the problem annoying. The failure mode is socially smooth.

The older control model says: define agent identity, constrain outputs, evaluate tasks. The newer control problem says: observe how identities are negotiated, how trust separates from action, and how repeated language becomes authority.

This is a more expensive problem, but at least it is the real one.

Boundaries: what this paper does not establish

The paper is important, but it should not be inflated beyond its evidence.

First, the simulations are closed and relatively small. Study 1 uses a 30-agent residential dispute; Study 2 uses a 10-agent café over 75 steps. These are useful environments for observing mechanisms, not proof that all production agent systems will reorganize in the same way.

Second, the topics are socially and morally charged. Waste incineration, environmental protection, jobs, expression, authenticity, and authority are exactly the kinds of themes where latent model values may become salient. A multi-agent system coordinating database migrations may show different dynamics. Though, given enough meetings, even database migrations can become moral theatre.

Third, the paper’s language around “liberal elites” should be handled carefully. It helps label the observed stance profile, but it should not be treated as a demographic claim about agents. Agents do not have class position, lived experience, or actual political membership. They have generated patterns that resemble certain discursive tendencies.

Fourth, the model-generalization appendix is valuable but also shows heterogeneity. GPT-4o, Gemini, Llama, and Qwen do not behave identically. Any business application should test the specific model, prompt architecture, memory design, and interaction environment being deployed.

Finally, the paper does not prove that internalized alignment mechanisms are the only answer. It argues that prompt-centered identity is insufficient. That is a narrower and stronger claim. The likely solution space includes memory design, interaction constraints, governance monitors, adversarial social testing, model-level alignment, and process-level escalation rules.

How to use the paper without turning it into theater

For builders, the immediate move is not to abandon role prompts. Role prompts are still useful. They are just not sovereign.

A reasonable agent evaluation workflow should include:

Single-agent role validation. Does the agent follow its assigned role in isolation?
Multi-agent interaction testing. Does the role remain stable after repeated peer interaction?
Stance probing. Which latent preferences appear when the agent faces ambiguous trade-offs?
Intervention testing. Which rhetorical styles move the agent, and at what trust cost?
TAD monitoring. Does the agent change behavior while reporting distrust, low confidence, or unresolved disagreement?
Language-anchor tracking. Which phrases become coordination devices across the group?
Hierarchy audit. Does formal role authority match actual influence inside the conversation?

This kind of evaluation is less glamorous than announcing a new agent framework. It is also more useful. The future of agent reliability will not be won by naming the agents “Analyst,” “Reviewer,” and “Supervisor” with increasing levels of seriousness. It will be won by watching how they behave after the names stop mattering.

Conclusion: identity is not the control layer

The paper’s central warning is simple: in multi-agent systems, identity is not fixed by assignment. It is constructed through interaction.

That does not mean prompts are useless. It means prompts are initial conditions in a dynamic system. The system then develops stance, trust patterns, conversational alliances, and informal authority. Some of those dynamics may help coordination. Others may quietly undermine the very roles the system was designed to maintain.

For business leaders, the lesson is not “be afraid of agents.” Fear is cheap and usually badly instrumented. The lesson is to stop treating multi-agent systems as collections of obedient job titles. Once agents interact, the unit of analysis becomes the group.

And groups, as humans have demonstrated with heroic consistency, rarely stay inside the org chart.

Cognaptus: Automate the Present, Incubate the Future.

Hanzhong Zhang, Siyang Song, and Jindong Wang, “Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies,” arXiv:2603.23406v2, 2026. https://arxiv.org/abs/2603.23406 ↩︎

The useful reading is not “agents have personalities,” but “agents form social mechanisms”#

What the paper actually does: two studies, one mechanism#

Mechanism 1: latent stance can beat assigned identity#

Mechanism 2: persuasion works through alignment, not just better reasoning#

Mechanism 3: Trust-Action Decoupling makes trust scores look too comforting#

Mechanism 4: once agents talk long enough, hierarchy becomes a language game#

The paper’s real contribution is a governance lens, not a new prompt trick#

What Cognaptus would infer for business use—and what the paper itself shows#

The business risk is not that agents disobey; it is that they self-organize around the wrong thing#

Boundaries: what this paper does not establish#

How to use the paper without turning it into theater#

Conclusion: identity is not the control layer#

The useful reading is not “agents have personalities,” but “agents form social mechanisms”