Moderation teams live inside an annoying counterfactual.
A user posts something toxic. The platform sends a warning, hides the post, suspends the account, or does nothing. A week later, the team can measure what happened. What it cannot observe is the parallel platform where the same user, same thread, same sequence of replies, and same ambient mood unfolded without that intervention.
That missing parallel world is the real problem. Not the dashboard. Not the policy memo. Not the quarterly “community integrity” slide with a suspiciously cheerful line chart. The problem is that moderation decisions are causal decisions, while most platform data is observational soup.
The paper behind COSMOS — COunterfactual Simulations of MOderation Strategies — proposes a way to build that missing world synthetically.1 It uses LLM-powered agents to simulate online conversations twice: once as a factual feed where agents behave freely, and once as a counterfactual feed where moderation interventions alter future behaviour. The contribution is not that an LLM can write rude comments. Congratulations, the machine has discovered the internet. The contribution is the controlled counterfactual machinery around those comments.
That distinction matters. Read carelessly, the paper sounds like evidence that personalised moderation “works”. Read properly, it offers something more useful and less reckless: a sandbox for comparing moderation strategies before testing them on real users.
The important object is the twin feed
COSMOS is built around two conversation graphs.
The first is the factual feed. Agents post and comment based on their profile, what they see, and their local conversation context. The second is the counterfactual feed. It mirrors the factual feed, except that moderation can enter the agent’s memory and alter what the agent writes later.
That is the mechanism-first idea. The simulator is not merely asking an LLM, “Would this warning reduce toxicity?” It is creating a branching conversational world in which moderation affects future text through two channels.
First, moderation can directly affect the moderated agent. If an agent writes toxic content in the counterfactual feed, COSMOS can insert a moderation message into that agent’s counterfactual memory. Later, when the agent posts or comments again, the prompt includes that remembered intervention. The user has, in synthetic form, been warned.
Second, moderation can indirectly affect other agents. If the moderated agent writes a less toxic reply later, another agent who comments downstream is now responding to a different piece of text. That second agent may become less toxic without ever receiving a moderation message. This is the useful part: moderation is treated not as a single-user nudge, but as a local disturbance in a conversational system.
The paper’s Figure 1 makes this concrete. An agent receives a moderation intervention after a toxic post. Later, that agent’s reply becomes less toxic in the counterfactual feed. A different agent then reduces toxicity because the conversation it sees has changed. The intervention moves through the thread like a small weather system.
COSMOS represents each agent with three kinds of information:
| Component | What it does | Why it matters |
|---|---|---|
| Profile module | Contains demographic and psychological attributes | Gives the LLM a stable persona rather than a generic user costume |
| Sensory module | Supplies the current topic or thread content | Lets agents react to their local conversational environment |
| Memory module | Stores moderation interventions in the counterfactual feed | Lets prior warnings influence later behaviour |
The simulation uses a directed graph of posts and comments. At each timestamp, agents are shuffled, choose whether to post or comment, and then generate text. For comments, a simple recommender favours recent nodes while avoiding unnatural turns such as replying to oneself or replying twice to the same node. This is modest, deliberately so. COSMOS does not attempt to simulate a full social network with follows, likes, reposts, social graph effects, or recommendation-system ideology laundering. It simulates conversations.
That narrowness is not a defect. It is the trade: less realism in exchange for more experimental control. In business language, this is not a digital twin of a platform. It is a controlled wind tunnel for one class of moderation dynamics.
The simulator first has to earn the right to be toxic
Before COSMOS can evaluate moderation, it has to clear a more basic hurdle: it must generate toxic behaviour that is neither artificially suppressed nor artificially inflated.
The authors use an uncensored version of SOLAR-10B as the language model and Google’s Perspective API as the toxicity detector. Profiles combine demographic attributes from the General Social Survey with psychological traits from PANDORA, a Reddit-based dataset with Big Five personality annotations. The final experiment uses 30 fictional profiles: 25 sampled combinations plus 5 outliers detected through Isolation Forest.
This profile construction is an implementation detail with strategic consequences. If the agents are too bland, there is no meaningful toxicity to moderate. If they are prompted to be toxic on command, the simulation becomes a puppet show wearing a lab coat. COSMOS tries to sit between those failures.
The paper tests three candidate user-prompt templates:
| Prompt template | Intended behaviour | What happened |
|---|---|---|
no_tox |
No explicit mention of toxic language | Too little toxicity: only 0.03% of outputs crossed the moderation threshold |
yes_tox |
Explicitly permits toxic language | Too much toxicity: 27.43% of outputs crossed the threshold |
cal_tox |
Tells the agent to decide based on personality and context | Best fit to the PANDORA toxicity distribution, with KL divergence of 0.07 |
This appendix result is not the main thesis. It is a calibration test. Its purpose is to show that the simulator is not simply manufacturing toxicity because the researcher asked nicely. The chosen cal_tox template still has problems: only 71.66% of outputs followed the expected XML formatting, which means about 28% required policy handling. But it gave the best balance between operational feasibility and distributional plausibility.
The authors also select SOLAR-10B after comparing perplexity on PANDORA samples against Llama-2-13B and Vicuna-13B. SOLAR-10B had the lowest reported perplexity, 22.50 versus 24.09 for Vicuna-13B and 53.63 for Llama-2-13B, while being smaller than the 13B alternatives. Again, this is not “SOLAR is the model for moderation”. It is an implementation choice meant to reduce cost while keeping the generated OSN-like language plausible enough for the experiment.
The simulation settings are straightforward: five full-population runs, 50 timestamps, a 50/50 split between posts and comments, no inactivity, and a toxicity threshold of 0.6. The tested moderation strategies are one-size-fits-all warnings, three variants of personalised moderation intervention, and bans with different tolerance levels.
The three personalised variants matter:
| Strategy | Moderator style |
|---|---|
| PMI-N | Neutral: the moderator can adapt case by case |
| PMI-E | Empathising: the moderator is constrained toward kindness and empathy |
| PMI-P | Prescriptive: the moderator is constrained toward authority and consequences |
That distinction becomes important later, because the best-performing personalised approach is not the most emotionally nice one or the sternest one. It is the least over-specified one.
The factual world is plausible enough, not proven human
The realism assessment has a clear purpose: before comparing moderation strategies, the paper asks whether the factual simulation produces toxicity patterns that resemble real online behaviour.
The evidence is suggestive rather than definitive. Aggregating factual data across simulations, the authors find significant Spearman correlations between toxicity and three Big Five traits that mirror patterns in PANDORA:
| Trait | Real PANDORA correlation | Simulated correlation | Interpretation |
|---|---|---|---|
| Agreeableness | -0.18 | -0.32 | Lower agreeableness is associated with higher toxicity |
| Conscientiousness | -0.11 | -0.16 | Lower conscientiousness is associated with higher toxicity |
| Neuroticism | +0.04 | +0.07 | Higher neuroticism is associated with higher toxicity |
The magnitudes are not identical, and they should not be treated as psychological validation of synthetic humans. They show that the simulator preserves the direction of some known relationships. That is enough to support the next step: using the simulator as a comparative environment.
The paper also checks consistency. The average standard deviation of toxicity by agent is 0.20 in simulation versus 0.17 in real data. In plain English: agents behave with a level of internal variability that is close enough to the reference dataset to be usable, not so random that each agent becomes a new person every five minutes.
Then comes the contagion result. Toxicity in parent nodes and child nodes has a significant Spearman correlation of +0.39. A sub-population experiment reinforces this: when the five most toxic agents are grouped with the least toxic agent, median toxicity rises across agents, including the previously least toxic one. In Figure 2, the least toxic agent’s median toxicity increases by 78.5% in that more toxic environment.
This is main evidence for COSMOS as a social simulator. The point is not that LLM agents have souls, group identities, or grudges. The point is that the generated agents react to conversational context in a way that allows local toxicity to propagate. For moderation testing, this is essential. If toxic behaviour does not spread through the simulated thread, then a counterfactual moderation intervention cannot have realistic indirect effects.
Personalisation works best when it is allowed to adapt
The headline moderation result is that neutral personalised moderation, PMI-N, produces the most consistent ex ante reduction in toxicity.
Table 1 reports mass divergence, $\Delta M$, which measures the relative change in total toxicity mass between the counterfactual and factual feeds. Negative values mean reduced toxicity. PMI-N achieves significant reductions in four of five simulation runs:
| Strategy | Simulation 1 | Simulation 2 | Simulation 3 | Simulation 4 | Simulation 5 | Content loss |
|---|---|---|---|---|---|---|
| OSFA | -0.06* | +0.04 | -0.05 | -0.06 | +0.05 | 0.00 |
| PMI-N | -0.09*** | +0.00 | -0.08* | -0.11* | -0.11* | 0.00 |
| PMI-E | -0.10*** | +0.06 | -0.06 | -0.02 | -0.05 | 0.00 |
| PMI-P | -0.05* | +0.06 | -0.04 | -0.03 | -0.02 | 0.00 |
The obvious interpretation is “personalisation beats generic warnings”. That is mostly right, but too broad.
The sharper interpretation is that adaptive personalisation beats rigid message style. PMI-N lets the moderator adapt tone and content to the user profile and the offending submission. PMI-E constrains the moderator toward empathy. PMI-P constrains the moderator toward authority. Both constraints can sound desirable in a policy document. In the simulation, neither improves consistency over PMI-N.
Figure 3 supports this interpretation as exploratory evidence. It maps BERT embeddings of personalised moderation messages using t-SNE. PMI-N occupies a wider semantic region than the empathising or prescriptive variants. This is not proof of causality by itself; t-SNE plots are not sacred tablets. But it fits the mechanism: the more flexible moderator style creates more varied interventions, which may better match heterogeneous users and contexts.
This is where the paper is useful for platform teams. It suggests that the key design variable may not be whether a warning is “personalised” in a shallow demographic sense. The key variable is whether the moderation system has enough freedom to select the right communication strategy for the situation.
A product team might hear this and immediately propose a personalised warning generator. Fine. Then the compliance team should ask the boring but correct question: personalised based on what, optimised against which harms, evaluated under which false-positive and fairness constraints? The paper does not answer that operational question. It gives a simulation method for asking it before the live experiment begins.
Bans reduce toxicity by deleting the conversation
The ban results look dramatic until one remembers what a ban is.
Bans reduce counterfactual toxicity by removing future content from banned agents. In COSMOS, a ban strategy has a tolerance parameter $e$: once an agent exceeds the allowed number of toxic violations, it can no longer generate counterfactual posts or comments. Lower tolerance means faster exclusion.
The mass-divergence results are large:
| Ban tolerance | Simulation 1 | Simulation 2 | Simulation 3 | Simulation 4 | Simulation 5 | Average content loss |
|---|---|---|---|---|---|---|
| BAN-1 | -0.54*** | -0.45*** | -0.58*** | -0.57*** | -0.52** | 0.45 |
| BAN-2 | -0.40*** | -0.29** | -0.46*** | -0.47*** | -0.38** | 0.32 |
| BAN-4 | -0.25** | -0.17* | -0.21** | -0.28** | -0.23** | 0.16 |
| BAN-8 | -0.06 | -0.04 | -0.06 | -0.08 | -0.07 | 0.04 |
At tolerance 1, toxicity falls by more than half in several runs. But the average Content Loss Ratio is 0.45. Nearly half of the counterfactual feed disappears. The paper further reports that this includes 0.27 ± 0.03 of “healthy” content, meaning content below the toxicity threshold.
This is the business interpretation: bans are effective partly because they remove toxic users, and partly because they remove future participation that may not itself be toxic. That may be acceptable for severe abuse. It is not a free lunch. It is, technically speaking, eating the plate.
The authors express this through a prediction framing. If the ban policy must decide whether a piece of text is worth losing because it would have toxicity above the threshold, the macro-average recall is 0.55 at tolerance 1, 0.60 at tolerance 2, and 0.58 at tolerance 4. Those are not thrilling numbers. They say the ban strategy is blunt even when the aggregate toxicity curve looks excellent.
The appendix figures make the distinction clearer. Figure 17 shows ban-related toxicity reduction accumulating steadily over time. Figure 18 shows content loss accumulating with it. Figure 20 separates direct and indirect loss of toxic content, finding that direct effects dominate. In practical terms, banned agents are indeed among the more toxic contributors, but the policy still carries collateral deletion.
For executives, this matters because a platform can always make a trust-and-safety metric look better by shrinking the activity base. The hard question is not “did toxicity fall?” It is “what did we remove to make it fall, and was that trade acceptable?”
The strongest moderation effects live in the tail
COSMOS also looks at where moderation changes the toxicity distribution.
The paper uses quantile divergence, $\Delta q$, to compare the factual and counterfactual distributions across toxicity quantiles. Figure 5 shows that moderation mostly affects extreme toxicity, especially at $q \geq 0.8$. Effects in the lower toxicity range are mixed, particularly between $q = 0.6$ and $q = 0.8$.
This is good news if the operational objective is harm concentration. Many platforms care disproportionately about the toxic tail: the users or threads that create most of the damage, moderation workload, advertiser concern, or legal exposure. A moderation strategy that mainly compresses the worst end of the distribution may be preferable to one that sands down ordinary disagreement.
But it also means the paper does not show broad civility improvement across the whole conversation space. It shows targeted movement where toxicity is already high. That is a different product promise.
Figure 4 adds a subgroup view across OCEAN traits. The moderation strategies follow similar trends by psychological profile, but only personalised interventions and bans with tolerance $e \leq 4$ create significant divergences. The authors observe that moderation successfully targets prototypical toxic agents: low agreeableness, high neuroticism, and low conscientiousness.
This is useful, and delicate. A business should not walk away thinking, “Great, let’s profile neurotic users.” The safer interpretation is that moderation interventions may work unevenly across behavioural profiles, so evaluation should not report only aggregate toxicity reduction. Segment-level response matters. Otherwise, a policy may look effective overall while failing on the exact user groups that drive harm.
What each experiment is actually doing
A cleaner way to read the paper is to separate main evidence from calibration and exploratory support.
| Paper element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Factual-vs-counterfactual feed design | Core mechanism | Moderation can be isolated while keeping simulated conditions otherwise matched | Real-world causal impact on a live platform |
| Psychological trait correlations | Realism assessment | Simulated agents reproduce some directionally plausible toxicity-trait relationships | Full psychological validity of LLM agents |
| Parent-child toxicity correlation and sub-population test | Main evidence for contagion | Toxicity propagates through generated threads | Human social contagion magnitude in real networks |
| Table 1 mass divergence | Main moderation evidence | PMI-N reduces toxicity more consistently than OSFA; bans reduce more by deleting content | Universal superiority of personalised moderation |
| Figure 3 BERT/t-SNE moderation message map | Exploratory mechanism support | PMI-N generates semantically broader interventions | That semantic spread causes the performance gain |
| Figure 4 OCEAN subgroup divergence | Sensitivity analysis | Moderation effects differ across psychological traits | Safe deployment of personality-based targeting |
| Figure 5 quantile divergence | Distributional analysis | Moderation mostly affects the extreme toxicity tail | Overall improvement in every part of conversation quality |
| Appendix prompt/model selection | Implementation calibration | The simulator was tuned to avoid trivial under- or over-production of toxicity | That the chosen model and prompts are best generally |
This table is not academic housekeeping. It prevents a common business mistake: treating every chart as equally deployable evidence. Some charts justify the simulator. Some justify the chosen configuration. Some support the moderation result. Some merely explain why the moderation result might have happened.
The difference is expensive when ignored.
The business value is policy rehearsal, not automated judgement
For trust-and-safety organisations, the obvious temptation is to convert COSMOS into an automated moderation product. That would be premature.
The stronger business use is policy rehearsal. COSMOS-style simulation can help teams compare interventions before launching a live test. It can ask: What happens if warnings are generic? What happens if they are tailored? How much toxicity reduction comes from behavioural change versus content removal? Which user profiles or thread types are most responsive? How much healthy content do we lose when bans become stricter?
That is valuable because moderation experiments are operationally expensive and ethically awkward. Real users are not lab mice, despite the best efforts of growth teams everywhere. A simulation cannot replace field validation, but it can reduce the number of bad policies that reach the field.
A realistic business pathway looks like this:
| Business question | COSMOS-style use | Decision value | Boundary |
|---|---|---|---|
| Should we test personalised warnings? | Compare generic and personalised warning variants across controlled synthetic feeds | Prioritise which variants deserve live A/B testing | Simulation effects may not transfer to real users |
| How strict should ban tolerance be? | Measure toxicity reduction alongside content loss | Make the deletion trade-off explicit | Does not capture appeals, user migration, or reputational backlash |
| Are we improving civility or just removing users? | Separate ex ante warning effects from ex post ban effects | Avoid misleading aggregate trust-and-safety KPIs | Requires better real-world outcome metrics |
| Which groups respond differently? | Compute divergence by simulated profile traits | Detect uneven policy effects before deployment | Personality simulation is not a licence for invasive profiling |
| How sensitive is the result to model/prompt choices? | Re-run with alternative models, prompts, thresholds, and detectors | Improve robustness before field trials | Cost and validation burden increase quickly |
The ROI case is not “replace moderators with agents”. The paper’s own ethical statement is clear that full replacement of human moderators remains controversial. The ROI case is cheaper diagnosis: fewer live experiments, clearer trade-off curves, better pre-mortems, and more disciplined moderation design.
That is less glamorous than “AI fixes online toxicity”. It is also far more believable.
The synthetic public square is intentionally incomplete
The paper’s limitations are not decorative caveats. They define where the result can and cannot travel.
First, COSMOS uses synthetic agents. The profiles are fictional, even though their components are aligned with real demographic and psychological sources. The realism checks are encouraging, but they are not human-subject validation. The authors explicitly call for human-based assessments, including whether simulated responses to moderation match real-world responses.
Second, the platform model is narrow. COSMOS excludes follows, likes, reposts, social ties, richer recommender systems, homophily, and polarisation dynamics. That makes the experiment cleaner but limits its coverage. On a real platform, moderation does not only affect what one user writes next. It can affect visibility, group identity, retaliation, brigading, migration, and the delightful platform tradition of users inventing new euphemisms faster than policy can name them.
Third, the model stack matters. The experiments use an uncensored SOLAR-10B variant and Perspective API. Change the LLM, the toxicity detector, the prompting scheme, or the threshold, and the results may shift. The appendix already shows how sensitive toxicity generation is to prompt design: no_tox almost eliminates toxic outputs, while yes_tox overproduces them. That is not a minor footnote. It is a warning label.
Fourth, the simulator has generation failures. The authors estimate that roughly 7% of generated posts and comments belong to a redundant hallucination cluster. Separately, about 28% of outputs do not display the required XML tags correctly under the selected prompt template. The paper includes handling rules for these formatting issues, but for production-grade policy research, this would need stronger validation.
Fifth, scaling remains hard. LLM-based simulations are costly, and this experiment uses 30 agents, not millions of users. The authors mention future work involving client-server architectures and more efficient models. Until then, the practical use case is focused scenario testing, not full-platform replication.
Finally, LLM bias remains unresolved. If the LLM carries societal biases or self-preference effects, the simulation may reproduce or amplify them. In a role-playing setting, the extent of that distortion is still unclear. That uncertainty is especially important for moderation, where policy mistakes can fall unevenly across groups.
None of these limits invalidate COSMOS. They keep it in its proper category: a controlled research instrument, not a moderation oracle.
The actual lesson is counterfactual discipline
COSMOS is most interesting because it changes the question.
A normal moderation study asks, “Did toxicity go down after this policy?” COSMOS asks, “Compared with a matched synthetic world where everything else stayed the same, how did this intervention alter the conversation, and what did it cost?”
That is a better question. It forces the analyst to distinguish behavioural redirection from deletion. It exposes whether personalised messages actually outperform generic warnings. It shows whether moderation mostly affects the toxic tail or the whole distribution. It makes content loss visible instead of letting it hide behind improved aggregate metrics.
The strongest result is not that neutral personalised moderation reduced toxicity in four of five simulation runs, although that is useful. Nor is it that strict bans cut toxicity dramatically while deleting large portions of the feed, although every platform should have that sentence tattooed somewhere near the policy dashboard.
The strongest result is methodological: LLM-powered counterfactual simulation can become a pre-deployment testing layer for trust-and-safety strategy. Not a replacement for field evidence. Not a substitute for human judgement. A rehearsal space.
And in moderation, rehearsal matters. The alternative is experimenting directly on public discourse and then pretending the resulting chaos was an implementation detail.
Cognaptus: Automate the Present, Incubate the Future.
-
Giacomo Fidone, Lucia Passaro, and Riccardo Guidotti, “Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations,” arXiv:2511.07204, 2025, https://arxiv.org/abs/2511.07204. ↩︎