Cover image

Judo, Not Armor: Strategic Deflection as a New Defense Against LLM Jailbreaks

Large language models have come a long way in learning to say “no.” When asked to give instructions for illegal acts or harmful behavior, modern LLMs are generally aligned to refuse. But a new class of attacks—logit manipulation—sidesteps this safety net entirely. Instead of tricking the model through prompts, it intervenes after the prompt is processed, modifying token probabilities during generation. This paper introduces Strategic Deflection (SDeflection), a defense that doesn’t rely on refusal at all. Instead, it teaches the model to elegantly pivot: providing a safe, semantically adjacent answer that appears cooperative but never fulfills the malicious intent. Think of it not as a shield, but as judo—redirecting the force of the attack instead of resisting it head-on. ...

July 31, 2025 · 3 min · Zelina
Cover image

The Trojan GAN: Turning LLM Jailbreaks into Security Shields

For years, LLM security research has mirrored the cybersecurity arms race: attackers find novel jailbreak prompts, defenders patch with filters or fine-tuning. But in this morning’s arXiv drop, a paper titled “CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks” proposes something fundamentally different: a single framework that learns to attack and defend simultaneously, using a GAN trained on internal embeddings. This paradigm shift offers not only better performance on both sides of the battlefield, but a new perspective on what it means to “align” a model. ...

July 9, 2025 · 3 min · Zelina
Cover image

Swiss Cheese for Superintelligence: How STACK Reveals the Fragility of LLM Safeguards

In the race to secure frontier large language models (LLMs), defense-in-depth has become the go-to doctrine. Inspired by aviation safety and nuclear containment, developers like Anthropic and Google DeepMind are building multilayered safeguard pipelines to prevent catastrophic misuse. But what if these pipelines are riddled with conceptual holes? What if their apparent robustness is more security theater than security architecture? The new paper STACK: Adversarial Attacks on LLM Safeguard Pipelines delivers a striking answer: defense-in-depth can be systematically unraveled, one stage at a time. The researchers not only show that existing safeguard models are surprisingly brittle, but also introduce a novel staged attack—aptly named STACK—that defeats even strong pipelines designed to reject dangerous outputs like how to build chemical weapons. ...

July 1, 2025 · 3 min · Zelina