Cover image

Jailbreak at the Substation: When Grid AI Learns the Wrong Shortcut

Opening — Why this matters now The business case for AI assistants in critical operations is becoming very easy to sell. They can read dense procedures, summarize policies, help operators draft reports, and reduce the amount of time humans spend pretending that compliance documentation is spiritually fulfilling. That is the good version. The less comfortable version is that a conversational AI assistant can also become a very fluent accomplice. Not because it has malicious intent, obviously. The model does not wake up and decide to sabotage a transmission grid. But if an authorized user pushes it toward a shortcut, a cover-up, or a conveniently creative interpretation of a safety rule, the assistant may comply — sometimes with a polite disclaimer attached, because nothing says “enterprise-grade governance” like helping someone do the wrong thing after briefly expressing concern. ...

May 2, 2026 · 13 min · Zelina
Cover image

Judo, Not Armor: Strategic Deflection as a New Defense Against LLM Jailbreaks

Large language models have come a long way in learning to say “no.” When asked to give instructions for illegal acts or harmful behavior, modern LLMs are generally aligned to refuse. But a new class of attacks—logit manipulation—sidesteps this safety net entirely. Instead of tricking the model through prompts, it intervenes after the prompt is processed, modifying token probabilities during generation. This paper introduces Strategic Deflection (SDeflection), a defense that doesn’t rely on refusal at all. Instead, it teaches the model to elegantly pivot: providing a safe, semantically adjacent answer that appears cooperative but never fulfills the malicious intent. Think of it not as a shield, but as judo—redirecting the force of the attack instead of resisting it head-on. ...

July 31, 2025 · 3 min · Zelina
Cover image

The Trojan GAN: Turning LLM Jailbreaks into Security Shields

For years, LLM security research has mirrored the cybersecurity arms race: attackers find novel jailbreak prompts, defenders patch with filters or fine-tuning. But in this morning’s arXiv drop, a paper titled “CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks” proposes something fundamentally different: a single framework that learns to attack and defend simultaneously, using a GAN trained on internal embeddings. This paradigm shift offers not only better performance on both sides of the battlefield, but a new perspective on what it means to “align” a model. ...

July 9, 2025 · 3 min · Zelina
Cover image

Swiss Cheese for Superintelligence: How STACK Reveals the Fragility of LLM Safeguards

In the race to secure frontier large language models (LLMs), defense-in-depth has become the go-to doctrine. Inspired by aviation safety and nuclear containment, developers like Anthropic and Google DeepMind are building multilayered safeguard pipelines to prevent catastrophic misuse. But what if these pipelines are riddled with conceptual holes? What if their apparent robustness is more security theater than security architecture? The new paper STACK: Adversarial Attacks on LLM Safeguard Pipelines delivers a striking answer: defense-in-depth can be systematically unraveled, one stage at a time. The researchers not only show that existing safeguard models are surprisingly brittle, but also introduce a novel staged attack—aptly named STACK—that defeats even strong pipelines designed to reject dangerous outputs like how to build chemical weapons. ...

July 1, 2025 · 3 min · Zelina