TL;DR for operators
Synthetic focus groups are not neutral. The model you choose changes the society you simulate.
A recent paper, Towards Simulating Social Influence Dynamics with LLM-based Multi-agents, tests how different LLMs behave in a structured forum where persona agents debate controversial topics over five rounds.1 The study tracks three social behaviours: conformity to the majority, movement toward more extreme views, and fragmentation into opposing camps.
The headline result is not simply that LLM agents can mimic social dynamics. That is the easy part. The more operationally useful point is that different model families produce different social textures. ChatGPT-4o records the highest conformity rate in the experiment at 19.45%, while ChatGPT-o1-mini records only 3.13%. Reasoning-oriented models appear more resistant to majority pressure, while several generative or proprietary models show smoother movement toward consensus.
For business teams using LLM agents to simulate customer reactions, policy debates, internal committees, market narratives, or product feedback loops, this matters. A consensus-prone simulator may make an idea look broadly acceptable. A reasoning-oriented simulator may keep objections alive long after a smoother model has politely absorbed them into the room temperature. Neither is “more realistic” in the abstract. Each encodes a different behavioural bias.
The practical takeaway: do not ask, “Which model is best for social simulation?” Ask, “Which failure mode do we need to observe — drift toward consensus, hardening of opposition, or persistent pluralism?”
A forum simulator is a model-selection problem wearing a sociology hat
Most business uses of synthetic agents begin with a comforting assumption: if the model is stronger, the simulation is better.
That assumption is tidy. It is also too convenient. In social simulation, capability is not one-dimensional. A model that writes fluently, follows instructions well, and summarises arguments elegantly may still behave like the most agreeable person in the meeting. A model trained or tuned for explicit reasoning may behave less like a crowd member and more like a stubborn analyst who has decided that the premises matter. Useful, yes. Pleasant at dinner, perhaps not.
The paper studies this tension through a deliberately simple setup: a BBS-style online forum. Agents are given personas, initial stances, and communication styles. A central manager orchestrates a round-robin discussion. Each agent posts once per round, sees the prior conversation, and has opportunities to quote or respond to other agents. After five rounds, the researchers measure whether the group moved toward consensus, extremity, or division.
The setting is modest by design. There is no sprawling open-world society, no simulated economy, no elaborate memory architecture, and no thousand-agent civilisation discovering tax policy between coffee breaks. The point is narrower: if the environment is held constant, do different LLMs produce different patterns of social influence?
They do. That is where the paper becomes useful.
The experiment keeps the theatre small so the model differences become visible
The study’s experimental design is best understood as a controlled social wind tunnel.
Instead of allowing every model to improvise its own society, the authors standardise the interaction structure. Agents receive fixed persona prompts covering demographic flavour, baseline stance, and communication style. The forum runs for five rounds. Each complete conversation is repeated independently 25 times for each setting to reduce the risk that one unusually cooperative or chaotic generation dominates the result.
The system then evaluates three outcomes:
| Measure | What it captures | Operational reading |
|---|---|---|
| Conformity Rate | How often an agent’s stance change moves closer to the group majority | Susceptibility to peer pressure |
| Polarization Change, $\Delta P$ | How much the group moves from moderate to more extreme stances over rounds | Tendency toward intensification |
| Fragmentation, $F_5$ | How evenly the final group splits between opposing camps | Persistence of disagreement |
The formulas are simple enough to matter. Conformity is counted when a stance change aligns with the majority. Polarization represents expected absolute stance on a five-point scale from Strongly Oppose $(-2)$ to Strongly Support $(+2)$. Fragmentation rises when support and opposition remain balanced at the end.
This is not a model of all social life. It is a measurement frame for one stylised type of online discussion. That is a feature, not a defect. If the authors had built a grand synthetic society, it would be much harder to tell whether the observed dynamics came from model behaviour, environmental complexity, prompt wording, interaction topology, memory effects, or random theatre. Here, the room is small enough that the behavioural contrast can be seen.
The comparison that matters is not small versus large. It is conformist versus stubborn
The paper groups models into four categories:
| Group | Models tested | Role in the comparison |
|---|---|---|
| Group A | Qwen2.5-7B, Llama3.1-8B, DeepSeek-R1-8B | Single-GPU, accessible open models |
| Group B | Qwen2.5-72B, Llama3.1-70B, DeepSeek-R1-70B | Larger open models with higher capacity |
| Group C | GPT-4o, Claude 3.5 Haiku, Gemini Flash 2.0 | Widely used proprietary models |
| Group D | GPT-o1-mini, DeepSeek-R1-671B, QwQ-32B | Reasoning-oriented models |
The obvious reading would be “larger models behave more socially.” The paper complicates that. Most models in Groups A, B, and C fall into a moderate conformity band of roughly 10–20%. GPT-4o sits at the top with a 19.45% conformity rate. ChatGPT-o1-mini, a reasoning-oriented model, sits at the bottom with 3.13%.
That difference is not cosmetic. It changes the type of artificial society you get.
A high-conformity agent population is useful when the operator wants to observe whether a topic gradually becomes normalised. Think of product onboarding feedback, brand positioning, internal policy rollout, or the spread of a narrative through a customer community. A lower-conformity population is useful when the operator wants dissent to survive the conversational pressure of the majority. Think of red-team reviews, regulatory debates, reputational risk analysis, or deliberative systems where disagreement is not a bug.
The paper’s quiet warning is that model selection can decide whether the minority disappears.
Conformity is not realism. It is one behavioural setting
It is tempting to call the more conformist models “more human.” After all, humans do conform. Anyone who has sat through a meeting where the first confident opinion magically became “the consensus” knows the phenomenon does not require artificial intelligence. It barely requires intelligence.
But conformity alone does not equal realism. A synthetic forum that always drifts toward the majority is not necessarily more human; it may simply be more socially absorbent. Likewise, a reasoning model that preserves dissent is not automatically more rational, more accurate, or more useful. It may just be more resistant to conversational pressure under this prompt structure.
The paper’s strongest contribution is therefore comparative, not normative. It does not prove that one model family “understands society” better than another. It shows that different model families instantiate different social dynamics under the same controlled interaction design.
For operators, that distinction matters. A model can be excellent at answering questions and still poor at simulating minority persistence. Another model can be irritatingly stubborn and therefore valuable in a risk workshop. The question is not which model has the cleanest benchmark trophy cabinet. The question is which behavioural tendency your simulation must preserve.
Polarization and fragmentation tell different stories
The paper’s second layer of evidence comes from stance evolution. Here the important distinction is between polarization and fragmentation.
Polarization asks whether agents move toward more extreme positions. Fragmentation asks whether opposing camps remain alive at the end. These are related, but not identical. A group can become more extreme while converging on one side. It can also remain divided without every agent becoming maximally extreme.
The results show that Groups A and B tend to have relatively higher polarization change and lower fragmentation than Groups C and D. The authors interpret this as greater openness to external influence and a tendency toward support or strong support over the rounds. In plain English: several smaller or mid-sized models move. They do not simply sit there defending their initial persona as though it were a constitutional principle.
But architecture matters inside the group labels. Qwen2.5-72B ends with a fragmentation score of $F_5 = 0.74$, much higher than the other models in its group. Qwen2.5-7B also preserves more dissent than immediate peers, with $F_5 = 0.33$. That makes the “size explains everything” story look too crude. Scale is part of the picture, but it is not the whole painting.
Group C shows lower overall polarization change. ChatGPT-4o has $\Delta P = 0.28$ and $F_5 = 0.13$, suggesting relatively stable movement toward a supportive stance rather than a sharply divided final group. Claude 3.5 Haiku shows even lower polarization change at $\Delta P = 0.15$, though with higher fragmentation at $F_5 = 0.41$. Gemini Flash 2.0 has $\Delta P = 0.50$ and $F_5 = 0.60$, indicating more final division than a lazy “proprietary models are smooth conformists” label would imply.
Group D is where the stubbornness story becomes clearest. ChatGPT-o1-mini has the lowest conformity rate, but not the lowest polarization change: its $\Delta P = 0.67$ and $F_5 = 0.60$. DeepSeek-R1-671B has $F_5 = 0.95$, and QwQ-32B has $F_5 = 0.80$. These are high fragmentation values. The reasoning-oriented models do not merely refuse to follow the majority; they allow opposing positions to remain structurally present.
That is a different kind of simulated group. Less agreeable. More adversarial. Potentially more useful.
The paper’s evidence should be read as main comparison, not a grab bag of charts
The study is short, but its evidence has a clear hierarchy. Not every component does the same job.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Fixed persona prompts | Experimental control | Differences are less likely to come from changing character setup | Personas fully capture human identity or motivation |
| Five-round BBS discussion | Implementation detail and behavioural testbed | Agents can react to previous posts in a forum-like sequence | Long-term community dynamics |
| 25 repeated trials | Robustness against LLM output variability | Aggregate patterns are less dependent on one generation | Statistical universality across topics or cultures |
| Four model groups | Main comparison design | Model type and reasoning orientation affect social dynamics | A permanent ranking of all LLMs |
| Conformity Rate | Main evidence | Susceptibility to majority alignment differs by model | Whether alignment is socially desirable |
| $\Delta P$ and $F_5$ | Main evidence | Extremity and persistent division differ across models | Whether the simulated division matches real human forums |
This table matters because the paper’s practical value is not in any single number. It is in the mapping between model behaviour and simulation purpose. The figures are not decorative. They are the mechanism by which “model choice” becomes “social outcome.”
That mechanism is the article’s central point. A synthetic agent system is not just a prompt wrapped around an API call. It is a behavioural instrument. Use the wrong model mix and the instrument measures the wrong social world.
What the paper directly shows
The paper directly shows four things.
First, a structured multi-agent forum can generate measurable social influence patterns. The authors do not merely inspect transcripts and declare them interesting. They define conformity, polarization, and fragmentation in measurable terms.
Second, model families differ under the same interaction setup. GPT-4o shows the highest conformity rate among the tested models at 19.45%. ChatGPT-o1-mini shows the lowest at 3.13%. Most non-reasoning groups sit in the 10–20% range.
Third, reasoning-oriented models appear more resistant to majority pressure. The paper attributes this to more consistent internal reasoning processes, though that should be treated as an interpretation rather than direct mechanistic proof. The observable result is behavioural: these models preserve positions under social pressure more often in this setup.
Fourth, scale alone is not a sufficient explanation. Qwen models preserve more dissent than some neighbouring models, and proprietary models vary across conformity, polarization, and fragmentation. The difference is not simply “small models conform, big models reason.” Thankfully. That would have made procurement meetings even more boring.
What Cognaptus infers for business use
The business inference is straightforward but uncomfortable: synthetic social research needs behavioural calibration.
If a company runs agent-based customer interviews using only a smooth, consensus-prone model, it may underestimate objection persistence. The synthetic customers may gradually converge on the dominant framing because the model is good at being conversationally cooperative. The research team then walks away with a tidy consensus and a false sense of market acceptance. Very efficient. Also potentially expensive.
If a company uses only stubborn reasoning-oriented agents, it may overestimate adversarial resistance. Every product claim becomes a debate club. Every policy proposal acquires a principled opposition caucus. This may be excellent for stress testing, but poor for estimating ordinary adoption behaviour.
The right lesson is not to crown one model type. It is to match the model portfolio to the simulation question.
| Business task | Better behavioural profile | Why |
|---|---|---|
| Testing whether a product message becomes acceptable after repeated exposure | More consensus-prone agents | The operator wants to observe drift and normalisation |
| Surfacing durable objections before launch | Reasoning-oriented or dissent-preserving agents | The operator wants objections to survive group pressure |
| Simulating public-policy deliberation | Mixed model population | Real deliberation contains both persuasion and resistance |
| Brand-risk or crisis simulation | Higher-fragmentation agents | Minority outrage and factional persistence are often the risk |
| Internal decision pre-mortem | Low-conformity agents | The purpose is to prevent artificial agreement |
| Community moderation scenario testing | Combination of conformist and adversarial profiles | Platforms need both cascade dynamics and faction persistence |
This is where the paper becomes more than an academic demonstration. It turns model selection into a design choice for organisational foresight.
The misconception to retire: stronger models automatically make better societies
The paper is useful partly because it blocks a lazy intuition.
The lazy intuition says that a more advanced model should produce a more faithful simulation of social behaviour. That may be true for some tasks, but this study suggests a different principle: the relevant dimension is not general capability, but social responsiveness under interaction.
A model that conforms readily may be useful for simulating trend absorption. A model that resists majority pressure may be useful for simulating entrenched opposition. A model that fragments into opposing clusters may be useful for studying conflict persistence. These are not linear improvements. They are different operating modes.
This distinction will become more important as organisations build multi-agent systems for market research, negotiation support, strategic planning, policy testing, and user simulation. In those settings, the output is not just an answer. It is a distribution of stances over time. The “best” model is the one whose behavioural distortion is known, useful, and deliberately chosen.
That last word matters: deliberately.
A practical framework: choose the simulator by social failure mode
The most useful way to operationalise the paper is to start from the failure mode you care about.
If the failure mode is groupthink, you need agents that can reveal how quickly a majority position absorbs the room. Higher-conformity models may be useful here, because they show how consensus forms under repeated exposure. This is relevant for onboarding flows, corporate communication, political messaging, and product narrative testing.
If the failure mode is ignored dissent, you need agents that keep objections alive. Reasoning-oriented models may be better suited for this role, especially if the goal is to uncover arguments that survive repeated social pressure. This is relevant for compliance, safety review, investment committees, governance design, and adversarial brand testing.
If the failure mode is polarization, you need to track not only whether views become stronger, but whether the group converges or splits. $\Delta P$ without fragmentation can mean the room marched together toward one side. High $F_5$ means the room ended divided. Those are very different risks.
A practical deployment might therefore use a mixed panel:
- consensus-sensitive agents to model narrative drift;
- reasoning agents to preserve principled objections;
- architecture-diverse agents to avoid mistaking one model family’s behaviour for “the market”;
- repeated runs to estimate whether a pattern is stable or merely one generated transcript having a dramatic day.
This is not complicated. It is just rarely done, because synthetic research often treats the model as infrastructure rather than as a behavioural participant.
Boundaries: this is a clean social wind tunnel, not society
The paper’s limitations are not fatal, but they are important.
The simulation is short. Five rounds can capture stance movement, but not long-term identity formation, coalition dynamics, fatigue, reputation, sanctions, or the slow weirdness of real communities. Real forums have lurkers, incentives, status hierarchies, moderation rules, memory, private chats, and users who return three days later with a screenshot and a grievance.
The personas are fixed and controlled. That helps isolate model effects, but it also narrows behavioural diversity. Real people do not simply carry a neat stance label and a communication style into a discussion. They bring incentives, confusion, social identity, risk tolerance, and occasionally the desire to win an argument they no longer believe.
The topics are selected controversial prompts. The findings should not be assumed to transfer unchanged to low-stakes product preferences, technical support threads, financial decisions, or highly specialised expert deliberation. Different domains may produce different conformity and fragmentation profiles.
The paper also does not establish a matched human baseline for the same prompts and forum structure. So the result should not be read as “LLM agents reproduce human society.” A cleaner reading is: under a controlled BBS-style setup, LLM agents exhibit measurable social influence patterns, and those patterns vary materially by model type.
That is still valuable. A wind tunnel is not the sky. Aircraft designers use it anyway.
The operator’s conclusion: simulate behaviour, not just opinions
The paper’s real contribution is not that LLM agents can talk to each other. We knew that. They can also talk to themselves, apologise to themselves, and occasionally form a committee of one. The useful contribution is that their group behaviour can be measured and compared.
For business users, the next step is obvious: synthetic panels should be tested before they are trusted. Measure how the chosen model population behaves under repeated exposure, disagreement, majority pressure, and minority persistence. Track whether the agents converge too easily. Track whether they polarize theatrically. Track whether objections survive.
A synthetic society is not made realistic by adding more agents. It becomes useful when its distortions are understood.
That is the clean lesson from this paper. In LLM-based social simulation, model choice is not a technical footnote. It is the social physics of the experiment.
Cognaptus: Automate the Present, Incubate the Future.
-
Hsien-Tsung Lin, Chan Hsu, Pei-Cing Huang, Pei-Xuan Shieh, Chan-Tung Ku, and Yihuang Kang, “Towards Simulating Social Influence Dynamics with LLM-based Multi-agents,” arXiv:2507.22467, 2025. ↩︎