LLM Security

Mind the Slot: Jailbreak Prompts Have Weak Points, Not Just Bad Words

Security teams like to search for suspicious strings. That habit is understandable. Strings are visible. They can be logged, filtered, matched, scored, and proudly displayed in dashboards. A bad suffix at the end of a prompt looks like a bad suffix at the end of a prompt. Convenient. Almost too convenient. The problem is that prompts are not flat text boxes. They are transformed into token sequences, wrapped in chat templates, and passed through attention layers that do not treat every position equally. Some positions receive more influence over the model’s next-token behavior than others. Put adversarial tokens there, and the same amount of “badness” can travel farther. ...

Don’t Just Guard the Door: Jailbreak Safety Needs Checkpoints

Don’t Just Guard the Door: Jailbreak Safety Needs Checkpoints A single prompt classifier is an attractive idea because it is simple, cheap, and easy to draw in a system diagram. The user sends a prompt. The guard says safe or unsafe. The model either answers or refuses. Very tidy. Also, increasingly incomplete. ...

Context Is the New Attack Surface

A benchmark score is easy to quote. It is harder to know what broke. In Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models, Pavlos Ntais reports an 81.0% attack success rate against GPT-OSS-20B on a held-out 200-item test set.1 That number is attention-grabbing. It is also not the main lesson. ...

Jailbreak and Enter: Why LLM Security Needs a Cube, Not a Scoreboard

Opening — Why this matters now The AI industry has spent the last two years teaching executives a strangely comforting phrase: “the model refused.” That phrase is now dangerously inadequate. A refusal is not a security architecture. It is a behavioral outcome under one prompt, one context window, one model version, one judge, and one assumption about what the attacker is trying to do. Change any of those variables and the safety story can change. Sometimes gently. Sometimes like a glass door discovering what gravity does. ...

Jailbreak at the Substation: When Grid AI Learns the Wrong Shortcut

Opening — Why this matters now The business case for AI assistants in critical operations is becoming very easy to sell. They can read dense procedures, summarize policies, help operators draft reports, and reduce the amount of time humans spend pretending that compliance documentation is spiritually fulfilling. That is the good version. The less comfortable version is that a conversational AI assistant can also become a very fluent accomplice. Not because it has malicious intent, obviously. The model does not wake up and decide to sabotage a transmission grid. But if an authorized user pushes it toward a shortcut, a cover-up, or a conveniently creative interpretation of a safety rule, the assistant may comply — sometimes with a polite disclaimer attached, because nothing says “enterprise-grade governance” like helping someone do the wrong thing after briefly expressing concern. ...

Mind the Drift: Why Stateful AI Guardrails Beat Bigger Models

A chatbot rarely fails in one clean dramatic explosion. More often, it is nudged. First, the user asks for a harmless explanation. Then a role-play frame. Then a historical analogy. Then a translation. Then a “purely fictional” operational detail. By the time the final request arrives, the model has already been walked across the room. The last prompt is not the attack. It is the receipt. ...

GAVEL: When AI Safety Grows a Rulebook

Rules are boring until the audit starts. That is roughly where enterprise AI safety is heading. A chatbot can be polite, policy-aligned, and apparently harmless on the surface, while still performing the internal work of manipulation, scam automation, or unsafe assistance. Text moderation catches what the model says. Classic activation monitoring tries to catch what the model is internally representing. But both can become awkward in production: one sees too little, the other often explains too little. ...

Prompted to Death: When Words Become a Denial-of-Service

A customer asks an AI assistant a question. The assistant begins answering, continues answering, wanders into repetition, and eventually reaches the maximum output limit. Nobody stole a password. No prohibited content appeared. The model may even have remained grammatically competent throughout the ordeal. It simply consumed far more computation than the request deserved. ...

When the Paper Talks Back: Lost in Translation, Rejected by Design

A PDF is supposed to sit quietly. It may contain claims, equations, tables, and occasionally an appendix long enough to test a reviewer’s commitment to science. It is not supposed to negotiate with the system judging it. That assumption becomes unreliable once a document enters an LLM-based workflow. To the human reader, a sentence rendered in white text may be invisible. To a text-extraction pipeline, it can remain perfectly legible—and potentially indistinguishable from an instruction the model is expected to follow. ...

Judo, Not Armor: Strategic Deflection as a New Defense Against LLM Jailbreaks

TL;DR for operators Most LLM safety systems still assume that, when a model sees a harmful request, the correct behaviour is refusal. That works until the attacker stops arguing with the prompt and starts interfering with generation itself. The paper behind this article, Strategic Deflection: Defending LLMs from Logit Manipulation, proposes SDeflection: a fine-tuning method that teaches a model to answer in a safe, topic-adjacent way rather than relying only on explicit refusal language.1 The model does not provide harmful instructions. It redirects the subject toward harmless information that is close enough to the original topic to survive attacks that try to force compliance-style openings. ...