Opening — Why this matters now

Every major platform claims to be tackling online toxicity—and every quarter, the internet still burns. Content moderation remains a high-stakes guessing game: opaque algorithms, inconsistent human oversight, and endless accusations of bias. But what if moderation could be tested not in the wild, but in a lab? Enter COSMOS — a Large Language Model (LLM)-powered simulator for online conversations that lets researchers play god without casualties.

Developed by a team at the University of Pisa, Evaluating Online Moderation via LLM-Powered Counterfactual Simulations proposes a radical idea: instead of scraping or shadow-testing real social platforms, why not simulate them using psychologically believable AI agents? In doing so, we can observe how toxicity spreads, how people respond to moderation, and whether personalized interventions actually help—all without violating privacy or platform policies.


Background — From moderation to modeling

Traditional moderation research is trapped between two walls: lack of data and lack of control. Social media APIs are restricted, toxic posts are rare relative to the full stream, and any field experiment risks real harm. Yet social behavior is inherently complex—collective, emotional, and context-sensitive. Enter Agent-Based Modeling (ABM): a simulation paradigm where individual software agents interact according to rules, allowing complex patterns like misinformation cascades or outrage contagion to emerge.

LLMs, trained on billions of human utterances, now breathe life into those agents. They can imitate personalities, sustain memory, and mimic affective reactions. COSMOS merges these advances, creating a virtual social network where human-like agents debate, argue, and occasionally spiral. Unlike prior ABM work, it introduces counterfactual parallelism: two mirrored worlds, identical except for the application of moderation policies. The result? A controlled experiment in digital civility.


Analysis — Inside COSMOS

Each COSMOS run creates two timelines:

  • Factual: agents converse freely.
  • Counterfactual: the same agents converse under moderation.

Agents are built from demographic and psychological profiles (using Big Five traits) and communicate through posts and comments on contentious topics—abortion, gun control, climate change. When toxic speech appears, moderation is applied in three ways:

Strategy Description Nature
BAN Removes repeat offenders Ex post (punitive)
OSFA Generic “one-size-fits-all” warnings Ex ante (preventive)
PMI Personalized moderation messages tailored to each agent’s psychological profile Ex ante (adaptive)

Crucially, COSMOS keeps everything else constant: same participants, same topics, same sequence of interactions. Only the moderation changes. This allows the researchers to measure moderation efficacy as a causal delta, rather than a mere correlation.


Findings — Personality, contagion, and the paradox of civility

The simulation yields three striking results:

  1. Toxicity behaves realistically. Agents with low agreeableness and high neuroticism produced more toxic messages—mirroring human data. Toxicity also spread contagiously: when one agent lashed out, replies tended to mirror that tone ($\rho = +0.39$).

  2. Personalization works best. Personalized moderation interventions (PMIs) significantly reduced toxicity compared to uniform messages. The neutral-tone variant (neither overly harsh nor empathetic) performed best, suggesting that tone-tuned messaging matters more than moralizing.

  3. Bans reduce toxicity—and conversation. Harsh ban policies cut toxic content but also silenced non-toxic speech. At the lowest tolerance setting, 45% of content vanished—a chilling trade-off between safety and participation.

Moderation Strategy Avg. Toxicity Change (ΔM) Content Loss Ratio
OSFA ~0.00 0%
PMI-Neutral −0.09*** 0%
PMI-Empathic −0.06 0%
BAN (e=1) −0.54*** 45%

*(Significance: **p < 0.01)

The team also observed that moderation effectiveness varied by personality type. Low-agreeableness and high-neuroticism agents responded most strongly to interventions—the same profiles most prone to toxicity. In other words, moderation that knows its target can be both firm and fair.


Implications — Synthetic societies as policy testbeds

COSMOS hints at a broader transformation: AI-powered counterfactual social science. Instead of waiting for real-world crises, regulators, ethicists, and platform designers could stress-test rules in silico. Questions like Should empathy or authority guide moderators? or How does deplatforming shape network health? can now be answered through simulation.

However, the paper is self-aware. LLMs hallucinate. They amplify biases. They are, in the authors’ own words, psychologically believable but not necessarily socially representative. Yet the direction is promising. COSMOS doesn’t claim to replace fieldwork—it complements it, offering a controlled environment where policy can evolve faster than outrage.


Conclusion — Modeling the messy middle

Social platforms are no longer just communication tools—they are behavioral ecosystems. Evaluating moderation through LLM-based counterfactuals might be the first step toward governing them with scientific precision rather than reactive outrage. COSMOS shows that moderation isn’t just about removing toxicity; it’s about understanding it.

Cognaptus: Automate the Present, Incubate the Future.