Cover image

When Guardrails Learn from the Shadows

Labels are expensive. Safety labels are worse. A normal classification project asks annotators to decide whether a customer complaint is urgent, whether a product photo contains a defect, or whether a support ticket belongs to billing. Annoying, yes. Existentially unpleasant, usually no. LLM safety moderation is different. The training examples may include malicious requests, jailbreak attempts, harmful advice, unsafe responses, and edge cases where intent is deliberately hidden under polite phrasing. The annotator must not only read the text but understand what the user is trying to make the model do. In other words, the expensive part is not clicking “safe” or “unsafe.” The expensive part is detecting intent when the user has carefully wrapped it in bubble wrap. ...

December 26, 2025 · 16 min · Zelina
Cover image

You Know It When You See It—But Can the Model?

Review queue. Someone has to decide whether an image is “unsafe,” “misleading,” “healthy,” “premium,” “clickbait,” “brand-safe,” or “not really our vibe.” The label sounds simple until the first borderline case appears. A salad with too much cream. A gaming ad that hints at easy money but never quite says it. A before-and-after photo where the “achievement” is visible only if one is feeling generous. ...

December 12, 2025 · 15 min · Zelina
Cover image

Flame Tamed: Can LLMs Put Out the Internet’s Worst Fires?

Flame Tamed: Can LLMs Put Out the Internet’s Worst Fires? A comment thread rarely explodes in one clean motion. It starts with a correction. Then someone reads the correction as condescension. Then another person adds a historical grievance, a screenshot, three exclamation marks, and the kind of moral certainty normally reserved for courtrooms and family dinners. By the time a moderator arrives, the thread is no longer a conversation. It is archaeology with insults. ...

December 3, 2025 · 17 min · Zelina
Cover image

Parallel Worlds of Moderation: How LLM Simulations Are Stress-Testing Online Civility

TL;DR for operators Moderation is usually measured after the mess has already happened. COSMOS changes the sequence: it lets researchers run a synthetic online conversation twice, once without moderation and once with a selected intervention, while keeping the simulated world otherwise constant.1 That is the useful idea. Not “LLMs can pretend to be angry internet users,” though they can, which is an achievement of sorts. The useful idea is controlled comparison. ...

November 12, 2025 · 16 min · Zelina
Cover image

Parallel Worlds of Moderation: Simulating Online Civility with LLMs

Moderation teams live inside an annoying counterfactual. A user posts something toxic. The platform sends a warning, hides the post, suspends the account, or does nothing. A week later, the team can measure what happened. What it cannot observe is the parallel platform where the same user, same thread, same sequence of replies, and same ambient mood unfolded without that intervention. ...

November 11, 2025 · 18 min · Zelina
Cover image

Humans in the Loop, Not Just the Dataset

TL;DR for operators AI-assisted monitoring does not become trustworthy because a human occasionally clicks “wrong label.” It becomes useful when the whole product is designed to capture, validate, resolve, and redeploy human judgement. The paper behind this article studies an open-source Telegram monitoring tool being developed with civil society organisations, using conspiracy-theory classification as the working scenario.1 Its practical contribution is a workflow: Telegram posts are classified, CSO users review labels during their normal monitoring work, their feedback is stored with metadata, and that accumulated feedback becomes a gold-standard dataset for model evaluation and refinement. ...

July 10, 2025 · 14 min · Zelina