Parallel Worlds of Moderation: How LLM Simulations Are Stress-Testing Online Civility

TL;DR for operators

Moderation is usually measured after the mess has already happened. COSMOS changes the sequence: it lets researchers run a synthetic online conversation twice, once without moderation and once with a selected intervention, while keeping the simulated world otherwise constant.¹ That is the useful idea. Not “LLMs can pretend to be angry internet users,” though they can, which is an achievement of sorts. The useful idea is controlled comparison.

The paper directly shows three things. First, its LLM agents produce toxicity patterns that resemble known psychological relationships in real data: lower agreeableness and conscientiousness align with higher toxicity, while higher neuroticism also tracks higher toxicity. Second, toxicity propagates through replies; hostile parent comments are associated with more toxic child comments. Third, personalised moderation interventions, especially the neutral variant, reduce toxicity more consistently than generic warnings, while aggressive ban policies reduce toxicity largely by deleting participation.

Cognaptus’ business inference is narrower and more practical: simulation can become a moderation pre-mortem. A platform, marketplace, gaming community, education forum, or enterprise collaboration tool could test moderation policies against likely behavioural dynamics before deploying them. That does not make simulation a substitute for human judgement, legal review, or live measurement. It makes it a cheaper diagnostic layer before real users become the experiment. Small improvement. Large moral housekeeping bill avoided.

The boundary is important. COSMOS does not model a complete social platform. It focuses on conversations, excluding likes, follows, reposts, social graph formation, ranking algorithms, and real adversarial users. The result is a policy wind tunnel, not a certified replica of the internet. Still, for operators currently choosing between generic warnings, escalation rules, and bans using little more than dashboard vibes, a wind tunnel is an upgrade.

Moderation usually fails before anyone can measure it

A familiar scene: a product team notices that a discussion space is getting worse. Support tickets rise. Moderators complain. Good users become quieter. Bad users become oddly energetic. Someone proposes stricter enforcement. Someone else worries about overreach. Legal wants consistency. Trust and Safety wants more headcount. Growth wants everyone to calm down and stop damaging retention. A spreadsheet appears, which means the organisation has reached the ceremonial stage of uncertainty.

The problem is not that platforms lack interventions. They have warnings, removals, ranking demotions, temporary suspensions, permanent bans, rate limits, community notes, appeals, and the ancient ritual of “please be respectful.” The problem is knowing what those interventions do before they reshape a community.

Real-world moderation is hard to evaluate because every policy change arrives inside a moving system. Users adapt. Topics shift. News cycles intrude. Moderators differ. Toxic users migrate. Benign users self-censor. Some people leave quietly, which is inconvenient because quiet exits rarely produce neat labelled datasets. As Dwork, Hays, Kleinberg, and Raghavan argue in a theoretical model of moderation and community formation, moderation can affect not just what content appears, but who chooses to participate at all.² That is where simple “removed content went down” metrics become dangerously thin.

The COSMOS paper attacks this measurement problem by building a parallel simulation environment. In one world, synthetic agents post and comment normally. In the other, the same simulated setup receives a moderation strategy. Same profiles, same conversational structure, same starting conditions, different intervention. The point is not to predict next Tuesday’s exact Reddit thread. The point is to isolate the effect of a moderation rule more cleanly than a live platform usually can.

That is a serious shift. It moves moderation evaluation from retrospective accounting to controlled counterfactual testing.

COSMOS is useful because it copies the world, not because it copies people perfectly

The easiest misunderstanding is to treat COSMOS as a claim about perfect synthetic humans. That would be convenient, dramatic, and wrong. The paper’s real contribution is not that LLM agents are now reliable substitutes for people. It is that LLM agents can be embedded in an agent-based model where the causal comparison is unusually clean.

COSMOS builds agents with socio-demographic and psychological profiles. Demographic attributes come from the General Social Survey, while personality traits use the Big Five structure through PANDORA, a Reddit-based dataset with psychological labels. Each agent has a profile module, a sensory module that receives conversational context, and a memory module that can store moderation messages. Agents choose whether to post or comment. Their text is scored for toxicity through Perspective API. If the score crosses a threshold, the counterfactual world applies a moderation strategy.

The parallel-world design matters because moderation effects are not always direct. If Agent 19 receives a moderation message and writes a less toxic reply later, that is a direct effect. If Agent 6 later responds less aggressively because the thread itself has changed, that is an indirect effect. COSMOS can observe both because it preserves the relationship between the factual and counterfactual conversation trees.

A standard classifier answers: “Is this item toxic?” An LLM moderator answers: “What should we do with this item?” COSMOS asks a different question: “What would the conversation have become under another intervention?” That question is more expensive, more speculative, and more useful.

This is also where the paper sits relative to earlier work. Studies such as Watch Your Language evaluate how LLMs perform on content moderation tasks, including rule-based community moderation and toxicity detection.³ Social simulation systems such as Y Social frame LLM-powered digital twins as controlled environments for studying online interaction and platform policy.⁴ COSMOS combines these directions into a narrower instrument: a simulator built specifically for testing moderation strategies through counterfactual conversational branches.

Narrowness is a feature here. Grand simulations of society tend to become philosophy with GPUs attached. COSMOS is more disciplined: simulate enough of the conversation to compare interventions, and no more than necessary.

The realism tests support the simulator, but do not sanctify it

Before COSMOS can say anything about moderation, it has to pass a dull but essential test: do the agents behave in ways that are at least plausibly aligned with real toxic behaviour?

The authors test this in several ways. They compare simulated toxicity patterns with PANDORA-based real data. They report significant Spearman correlations between toxicity and Big Five traits that move in the same direction as real data: agreeableness is negatively correlated with toxicity in both real and simulated data, as is conscientiousness; neuroticism is positively correlated in both. The reported values are not identical, which would actually be suspicious. They are directionally aligned: agreeableness is $ρ=-0.18$ in real data and $ρ=-0.32$ in simulation; conscientiousness is $ρ=-0.11$ and $ρ=-0.16$; neuroticism is $ρ=0.04$ and $ρ=0.07$.

Interpretation: the agents are not random profanity dispensers. Their toxicity has a psychological structure recognisable enough to make intervention testing worth discussing.

The authors also compare consistency. The average standard deviation of toxicity distributions is $0.17$ in real data and $0.20$ in simulation. In business language, the agents have behavioural signatures that persist enough to be useful. They are not stable personalities in the human sense. They are stable enough synthetic risk profiles.

Then comes the more important finding: toxicity spreads through the simulated conversation tree. Parent and child node toxicity have a significant positive correlation, reported as $ρ=+0.39$ with $p=0.0$. The paper also runs a sub-population test with the five most toxic agents and the least toxic one; toxicity rises in that more hostile environment. That result matters because moderation is often evaluated as if each post were an isolated object. It is not. A toxic reply is also a prompt for the next person.

This is where COSMOS becomes operationally interesting. If toxicity is contagious, then moderation has multiplier effects. A good intervention may prevent downstream hostility. A bad one may escalate the thread, even if the original post was technically handled. The unit of analysis shifts from content item to conversation trajectory.

Personalisation beats generic warning because the message becomes part of memory

COSMOS tests three broad moderation families.

Strategy	What it does	What the paper shows	Operational reading
OSFA	Sends the same generic warning to everyone	Weak and inconsistent reduction	Cheap to implement, easy to govern, usually underpowered
PMI	Generates a personalised moderation intervention based on the agent profile and toxic text	Neutral PMI reduces toxicity most consistently among ex ante approaches	Better behavioural fit, but higher design and oversight burden
BAN-e	Bans after more than $e$ violations	Strong toxicity reduction, with content loss	Effective partly because it removes users and their future content

The most useful result is not simply “personalisation works.” That would be too neat. The paper’s stronger point is that personalised moderation works through memory. In COSMOS, a moderation message is stored in the counterfactual agent’s memory module and shapes future generation. The intervention is not a one-off label. It becomes part of the user’s subsequent context.

Among the personalised variants, the neutral PMI performs best across the full-population simulations: it produces significant reductions in four of five runs. The empathising and prescriptive variants are less consistently effective. That distinction matters. The result does not say “always be empathetic,” even though empathy-based counterspeech has real-world support in field experiments on racist hate speech.⁵ It says that, in this simulator, the personalised message that retained flexibility adapted better than messages forced into a fixed emotional posture.

That is a useful corrective for product teams. Personalisation is not the same as warmth. A moderation system that always sounds kind can become patronising. One that always sounds authoritative can become brittle. Neutral PMI works because it gives the moderator agent room to adapt the intervention to the case. Apparently, even synthetic scolds benefit from not being trapped in brand voice guidelines.

The result also connects to prior arguments for personalised online moderation. Cresci, Trujillo, and Fagni argue that one-size-fits-all moderation conflicts with socio-behavioural theory and that interventions should be more user-centred.⁶ COSMOS adds a test bench for that claim. It does not prove that a production platform should automate personalised persuasion at scale. It shows that the hypothesis is worth testing before the platform rolls out another generic warning written by committee and embalmed by legal.

Bans reduce toxicity, but the accounting is not innocent

Ban policies perform strongly in the paper’s toxicity metric. BAN-1, the strictest version, produces large negative mass divergence values across all five simulations. But the reason is not mysterious. If users cannot continue posting, their future toxicity disappears from the counterfactual feed.

The paper handles this correctly by reporting Content Loss Ratio. BAN-1 has a CLR of $0.45 \pm 0.04$, meaning nearly half of the counterfactual content disappears. More awkwardly, the authors report that this includes $0.27 \pm 0.03$ of “healthy” content below the toxicity threshold. At the strictest ban setting, the macro-average recall for predicting whether a text is worth losing resembles a random classifier: $0.55$, compared with $0.60$ at BAN-2 and $0.58$ at BAN-4.

That is the part operators should not skip. A ban can make toxicity metrics look excellent while damaging participation, removing future non-toxic contributions, and changing the community composition. Sometimes that is still the right choice. Persistent abuse is not a growth strategy, despite what some comment sections seem to believe. But the business question is not “did toxicity fall?” The question is “what did we destroy to make it fall?”

Moderation needs at least two ledgers:

Metric family	What it captures	Why it matters
Toxicity reduction	Whether harmful language decreases	Protects users, moderators, brand trust, and legal exposure
Content loss	How much participation disappears	Reveals collateral damage and possible over-enforcement
Healthy content loss	Whether non-toxic future contributions are removed	Separates safety gains from blunt suppression
Downstream thread effect	Whether later replies improve or worsen	Captures contagion, escalation, and indirect benefits
Segment effect	Which user profiles are most affected	Detects uneven intervention impact and governance risk

The simulator does not solve the trade-off. It makes the trade-off visible earlier. That is already more than many moderation dashboards manage.

What COSMOS directly shows, what Cognaptus infers, and what remains open

The paper’s claims are strongest when they stay close to the simulation evidence. COSMOS shows that, under its modelling assumptions, personalised moderation can reduce simulated toxicity without content loss, while bans reduce toxicity with substantial content loss. It also shows that LLM agents can reproduce some plausible psychological and contagion patterns.

Cognaptus’ inference is that the business value lies in policy prototyping. A company could use a COSMOS-like method to test candidate interventions before live deployment. The system would be especially useful when moderation choices affect community health, user trust, regulatory exposure, or labour cost. Think gaming communities, creator platforms, internal workplace forums, financial communities, education platforms, marketplaces, or any product where “engagement” sometimes means “people yelling at each other in ways that are expensive later.”

But inference is not evidence. A production pipeline would need more than a paper prototype. It would need calibrated profiles, platform-specific behavioural data, human review of generated conversations, careful toxicity labelling, adversarial testing, appeal pathways, audit logs, and legal checks. A synthetic society with a toxicity score is not governance. It is a rehearsal room.

The distinction can be summarised cleanly:

Layer	Paper directly supports	Cognaptus business inference	Still uncertain
Counterfactual design	Parallel factual/counterfactual simulations isolate moderation effects better than uncontrolled observation	Moderation teams can run policy pre-mortems before live rollout	Whether results transfer to real communities with strategic users
Agent realism	Simulated toxicity aligns with selected personality-toxicity patterns in real data	Synthetic personas can support early policy testing	Whether human users respond similarly to moderation messages
Personalised moderation	PMI-N reduces toxicity more consistently than generic warnings in the simulator	Adaptive interventions may preserve participation better than blunt enforcement	Governance, fairness, privacy, and user acceptance of personalisation
Ban strategies	Strict bans reduce toxicity but cause high content loss	Dashboards must separate safety gain from participation loss	Long-term migration, retaliation, and off-platform displacement
Platform scope	COSMOS models conversations with controlled variability	Useful for dialogue-heavy products and community threads	Not yet a full model of feeds, recommender systems, or social graphs

This is the kind of distinction AI governance often lacks. The paper is not a licence to automate moderation. It is a method for testing moderation logic before automation becomes expensive, political, and hard to roll back.

The business value is cheaper diagnosis, not magical civility

For operators, the immediate use case is not replacing moderators. It is reducing policy guesswork.

A mature moderation organisation already runs experiments, audits enforcement, reviews appeals, samples decisions, and measures harm. COSMOS-style simulation adds a pre-production layer. Before changing escalation thresholds, message wording, or ban tolerance, the team can ask: what happens to toxicity mass, content loss, healthy content loss, and downstream replies in a controlled synthetic environment?

That matters because moderation interventions have second-order effects. A warning may calm one user and irritate another. A ban may protect a thread while pushing bad actors into evasion. A generic message may be fair in form and useless in practice. A personalised message may be effective but raise privacy and manipulation concerns. The right answer depends on the platform, the community, the harm type, and the tolerance for collateral damage. Yes, product governance remains annoyingly contextual. Tragic.

The ROI pathway is practical:

Use simulation to screen policy candidates before live testing.
Reject obviously brittle strategies early.
Identify which user profiles or conversational contexts require human review.
Estimate trade-offs between toxicity reduction and participation loss.
Feed the best candidates into small, monitored field experiments.
Compare simulated effects against live outcomes and recalibrate.

This is not cheaper because it removes humans. It is cheaper because it uses humans where their judgement matters more: validating scenarios, reviewing edge cases, setting policy boundaries, and interpreting failures. Human moderators should not be asked to serve as both trauma buffer and experimental instrument. That division of labour was never elegant.

The missing platform dynamics are not footnotes

COSMOS has serious boundaries, and they affect practical interpretation.

First, the simulator models conversations, not complete online social networks. It excludes likes, follows, reposts, social relationships, and full recommender dynamics. This increases control and lowers computational cost, but it removes mechanisms that matter in live platforms. A feed ranking system can reward outrage. A follower graph can amplify identity conflict. Likes can encourage pile-ons. Reposts can export toxicity into new audiences. None of that is trivial decoration.

Second, the authors find that roughly 7% of generated posts and comments fall into a redundant hallucination cluster after inspecting BERT-based clusters. That is not catastrophic, but it is operationally relevant. If a simulation is used for policy decisions, generated content needs validation. Otherwise, the organisation may optimise against artefacts produced by the simulator itself. Nothing says “AI governance” quite like regulating imaginary nonsense with great confidence.

Third, toxicity scoring is itself a contested measurement layer. Perspective API is widely used and technically sophisticated, including multilingual and byte-level approaches designed for robustness across domains and obfuscation.⁷ But any toxicity classifier carries boundary choices: what counts as harm, how context is handled, which dialects or communities are misread, and what false positives cost. COSMOS inherits those measurement choices.

Fourth, simulated agents are not adversarial humans. Real users learn policy boundaries, coordinate evasions, deploy sarcasm, change accounts, weaponise reporting tools, and complain publicly when moderation feels unfair. The paper on language evolution under social media regulation shows why this matters: LLM-based multi-agent simulations can model users adapting coded language under supervision, but that also reminds us that moderation creates strategic adaptation rather than static compliance.⁸

Finally, personalisation raises governance questions. A message tailored to a user’s profile may reduce harm. It may also feel manipulative, discriminatory, or opaque. The more effective the intervention becomes, the more carefully it needs to be governed. Efficiency is not a moral solvent. Annoying, but true.

The real lesson is counterfactual discipline

COSMOS is not important because it gives moderation teams a synthetic internet to play with. It is important because it imposes counterfactual discipline on a field that often confuses action with evidence.

The common moderation workflow is reactive: detect, remove, warn, ban, review, argue, repeat. COSMOS suggests a more scientific loop: simulate, compare, measure trade-offs, validate, then deploy cautiously. That will not make online civility easy. It may make policy mistakes cheaper to discover.

The paper’s best contribution is therefore methodological. It reframes moderation as a testable intervention inside a dynamic social system. The strongest business implication is not that platforms should trust LLM agents. It is that operators should stop shipping moderation policies without asking what the alternative timeline might have looked like.

A simulation will never be the community. But a well-designed counterfactual simulation can reveal whether a policy is calming the room, merely deleting the loudest people, or quietly making future conflict more likely. For any platform that depends on user participation, that distinction is not academic. It is the difference between governance and expensive theatre.

Cognaptus: Automate the Present, Incubate the Future.

Giacomo Fidone, Lucia Passaro, and Riccardo Guidotti, “Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations,” arXiv:2511.07204, 2025. https://arxiv.org/abs/2511.07204 ↩︎
Cynthia Dwork, Chris Hays, Jon Kleinberg, and Manish Raghavan, “Content Moderation and the Formation of Online Communities: A Theoretical Framework,” arXiv:2310.10573, 2023. https://arxiv.org/abs/2310.10573 ↩︎
Deepak Kumar, Yousef AbuHashem, and Zakir Durumeric, “Watch Your Language: Investigating Content Moderation with Large Language Models,” arXiv:2309.14517, 2023. https://arxiv.org/abs/2309.14517 ↩︎
Giulio Rossetti et al., “Y Social: an LLM-powered Social Media Digital Twin,” arXiv:2408.00818, 2024. https://arxiv.org/abs/2408.00818 ↩︎
Dominik Hangartner et al., “Empathy-based counterspeech can reduce racist hate speech in a social media field experiment,” Proceedings of the National Academy of Sciences, 2021. https://www.pnas.org/doi/10.1073/pnas.2116310118 ↩︎
Stefano Cresci, Amaury Trujillo, and Tiziano Fagni, “Personalized Interventions for Online Moderation,” arXiv:2205.09462, 2022. https://arxiv.org/abs/2205.09462 ↩︎
Alyssa Lees et al., “A New Generation of Perspective API: Efficient Multilingual Character-level Transformers,” arXiv:2202.11176, 2022. https://arxiv.org/abs/2202.11176 ↩︎
Jinyu Cai et al., “Language Evolution for Evading Social Media Regulation via LLM-based Multi-agent Simulation,” arXiv:2405.02858, 2024. https://arxiv.org/abs/2405.02858 ↩︎

TL;DR for operators#

Moderation usually fails before anyone can measure it#

COSMOS is useful because it copies the world, not because it copies people perfectly#

The realism tests support the simulator, but do not sanctify it#

Personalisation beats generic warning because the message becomes part of memory#

Bans reduce toxicity, but the accounting is not innocent#

What COSMOS directly shows, what Cognaptus infers, and what remains open#

The business value is cheaper diagnosis, not magical civility#

The missing platform dynamics are not footnotes#

The real lesson is counterfactual discipline#