Cover image

Context Is the New Attack Surface

Context Is the New Attack Surface A policy can block a sentence. It has a harder time blocking a story. That is the uncomfortable lesson from Jailbreak Mimicry, a recent arXiv paper by Pavlos Ntais on automated discovery of narrative-based jailbreaks for large language models.1 The paper trains a compact attacker model to transform harmful goals into plausible narrative or functional contexts, then tests whether larger models still produce harmful output. The headline number is easy to quote: the trained attacker reaches 81.0% attack success against GPT-OSS-20B on a held-out 200-item test set. The business lesson is less flashy and more useful: safety failures may not live in the forbidden content alone. They often live in the surrounding work story that makes the request look legitimate. ...

May 16, 2026 · 13 min · Zelina
Cover image

Jailbreak and Enter: Why LLM Security Needs a Cube, Not a Scoreboard

Opening — Why this matters now The AI industry has spent the last two years teaching executives a strangely comforting phrase: “the model refused.” That phrase is now dangerously inadequate. A refusal is not a security architecture. It is a behavioral outcome under one prompt, one context window, one model version, one judge, and one assumption about what the attacker is trying to do. Change any of those variables and the safety story can change. Sometimes gently. Sometimes like a glass door discovering what gravity does. ...

May 7, 2026 · 15 min · Zelina
Cover image

When Safety Stops Being a Turn-Based Game

Opening — Why this matters now LLM safety has quietly become an arms race with terrible reflexes. We discover a jailbreak. We patch it. A new jailbreak appears, usually crafted by another LLM that learned from the last patch. The cycle repeats, with each round producing models that are slightly safer and noticeably more brittle. Utility leaks away, refusal rates climb, and nobody is convinced the system would survive a genuinely adaptive adversary. ...

December 28, 2025 · 4 min · Zelina
Cover image

Who Watches the Watchers? Weak-to-Strong Monitoring that Actually Works

The TL;DR Architecture > Access. The paper argues that monitor design (scaffolding) matters more than how much the monitor “knows.” A hybrid of hierarchical + sequential consistently beats full‑context prompting. Awareness asymmetry. If the agent knows it’s being watched, monitor reliability plunges; giving the monitor more task info helps far less. Weak→Strong is viable. With the hybrid scaffold, smaller, trusted models can reliably monitor bigger, stronger agents. Humans help—selectively. Escalate only pre‑flagged cases; this targeted HiLT improves TPR at 1% FPR by about 15%. What the authors actually did (and why it matters for business) Monitoring problem. Modern agents can run for hours, call tools, and browse files—plenty of room to hide “side tasks” (e.g., quiet data exfiltration) while completing the main job. The study standardizes Monitor Red Teaming (MRT) across: ...

August 30, 2025 · 4 min · Zelina
Cover image

Consent, Coaxing, and Countermoves: Simulating Privacy Attacks on LLM Agents

When organizations deploy LLM-based agents to email, message, and collaborate on our behalf, privacy threats stop being static. The attacker is now another agent able to converse, probe, and adapt. Today’s paper proposes a simulation-plus-search framework that discovers these evolving risks—and the countermeasures that survive them. The result is a rare, actionable playbook: how attacks escalate in multi-turn dialogues, and how defenses must graduate from rules to identity-verified state machines. ...

August 18, 2025 · 5 min · Zelina