Jailbreaks

Mind the Slot: Jailbreak Prompts Have Weak Points, Not Just Bad Words

Security teams like to search for suspicious strings. That habit is understandable. Strings are visible. They can be logged, filtered, matched, scored, and proudly displayed in dashboards. A bad suffix at the end of a prompt looks like a bad suffix at the end of a prompt. Convenient. Almost too convenient. The problem is that prompts are not flat text boxes. They are transformed into token sequences, wrapped in chat templates, and passed through attention layers that do not treat every position equally. Some positions receive more influence over the model’s next-token behavior than others. Put adversarial tokens there, and the same amount of “badness” can travel farther. ...

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard For many organizations, LLM safety is still treated like a checkpoint: run a benchmark, report an attack success rate, add a few guardrails, and move on. The resulting dashboard looks reassuringly official. It may even have decimals. Unfortunately, adversarial users do not attack dashboards. They attack systems. ...

Jailbreak ASR Is Wearing a Costume

The number looked safe. Then someone ran it twice. A familiar business problem: one vendor says its model resists jailbreaks. Another red-team report says a new attack reaches a spectacular Attack Success Rate. A compliance team sees a percentage, puts it into a risk register, and moves on. Unfortunately, that percentage may be doing more acting than measuring. ...

Red Queen Receipts: AI Security Testing Needs Logs, Not Vibes

Security testing is not a screenshot. A model gives a dangerous answer. Someone posts the transcript. A vendor says the model has been updated. A consultant turns the incident into a slide titled “AI risk is real.” Everyone nods gravely. Very mature. Very enterprise. The harder question is less theatrical: can the same vulnerability be tested again, under controlled conditions, with visible logs, a consistent evaluator, repeatable statistics, and enough human inspection to make the result defensible? ...

Context Is the New Attack Surface

A benchmark score is easy to quote. It is harder to know what broke. In Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models, Pavlos Ntais reports an 81.0% attack success rate against GPT-OSS-20B on a held-out 200-item test set.1 That number is attention-grabbing. It is also not the main lesson. ...

Jailbreak and Enter: Why LLM Security Needs a Cube, Not a Scoreboard

Opening — Why this matters now The AI industry has spent the last two years teaching executives a strangely comforting phrase: “the model refused.” That phrase is now dangerously inadequate. A refusal is not a security architecture. It is a behavioral outcome under one prompt, one context window, one model version, one judge, and one assumption about what the attacker is trying to do. Change any of those variables and the safety story can change. Sometimes gently. Sometimes like a glass door discovering what gravity does. ...

CRaFT and the Illusion of Safety: When ‘Sorry’ Is Just a Circuit

A refusal is easy to recognize. The model says it cannot help. The sentence sounds polite. The compliance team relaxes for three seconds. Everyone moves on. That is the comfortable version of AI safety: refusal as an observable behavior. The uncomfortable version is that refusal may be only the visible end of a much narrower internal computation. If that computation can be found, isolated, and steered, then the model’s “sorry, I can’t assist with that” is not a moral boundary. It is a circuit behavior. Very reassuring, in the same way a locked glass door is reassuring before someone points out the hinge. ...

Lost in Translation: When Safety Contracts Collapse Across 2.1 Billion Voices

A chatbot walks into a multilingual market Imagine a bank, hospital, telecom platform, or public-service chatbot being rolled out across South Asia. The model has passed English safety tests. It refuses harmful requests in structured evaluation. Its vendor dashboard looks reassuring. The compliance team exhales. Then users arrive. They do not all write in English. They do not all use one script. They mix Hindi and English, write Urdu in Latin letters, switch between native script and romanization, and ask ordinary questions wrapped in messy instructions. In other words, they behave like real users, which is always inconvenient for benchmark design. ...

Prompt, Probe, Persist: How Multi‑Turn RL Is Rewriting the Jailbreak Playbook

A chatbot rarely fails all at once. In production, failure is usually more boring than cinema. A user asks something borderline. The model refuses. The user rephrases. The model gives a harmless explanation. The user narrows the topic. The model follows the conversation. Then, several turns later, the assistant provides content it should not have provided. No thunder. No villain monologue. Just an interaction history doing what interaction histories do: accumulating context. ...

Refusal, Rewired: Why One Safety Direction Isn’t Enough

Safety teams like switches. They are easy to name, easy to diagram, and easy to pretend are under control. For language models, “refusal” has often been treated with roughly that mental model. A harmful prompt enters. Somewhere inside the model, a refusal feature lights up. The model says no. If researchers can identify the feature, they can study it, steer it, strengthen it, or—less comfortably—remove it. ...