Red Teaming

The Tool Response Is Not Your Boss

TL;DR for operators The paper’s useful message is not “LLM agents are unsafe,” which is too vague to help anyone do anything before lunch. The useful message is narrower and more operational: agents become vulnerable when untrusted content from SaaS integrations is read into the agent context and then treated as authority for a later action. ...

The Jailbreak Wasn’t Written. It Was Bred.

TL;DR for operators The paper introduces GAS-Leak-LLM, a black-box method that uses a genetic algorithm to evolve adversarial suffixes: small text sequences appended to harmful prompts to increase the chance that a model produces unsafe content.1 The important part is not that another jailbreak exists. We have enough of those. The important part is that jailbreak discovery is framed as a repeatable optimization loop using only model queries. ...

Mind the Slot: Jailbreak Prompts Have Weak Points, Not Just Bad Words

Security teams like to search for suspicious strings. That habit is understandable. Strings are visible. They can be logged, filtered, matched, scored, and proudly displayed in dashboards. A bad suffix at the end of a prompt looks like a bad suffix at the end of a prompt. Convenient. Almost too convenient. The problem is that prompts are not flat text boxes. They are transformed into token sequences, wrapped in chat templates, and passed through attention layers that do not treat every position equally. Some positions receive more influence over the model’s next-token behavior than others. Put adversarial tokens there, and the same amount of “badness” can travel farther. ...

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard For many organizations, LLM safety is still treated like a checkpoint: run a benchmark, report an attack success rate, add a few guardrails, and move on. The resulting dashboard looks reassuringly official. It may even have decimals. Unfortunately, adversarial users do not attack dashboards. They attack systems. ...

Jailbreak ASR Is Wearing a Costume

The number looked safe. Then someone ran it twice. A familiar business problem: one vendor says its model resists jailbreaks. Another red-team report says a new attack reaches a spectacular Attack Success Rate. A compliance team sees a percentage, puts it into a risk register, and moves on. Unfortunately, that percentage may be doing more acting than measuring. ...

Red Queen Receipts: AI Security Testing Needs Logs, Not Vibes

Security testing is not a screenshot. A model gives a dangerous answer. Someone posts the transcript. A vendor says the model has been updated. A consultant turns the incident into a slide titled “AI risk is real.” Everyone nods gravely. Very mature. Very enterprise. The harder question is less theatrical: can the same vulnerability be tested again, under controlled conditions, with visible logs, a consistent evaluator, repeatable statistics, and enough human inspection to make the result defensible? ...

Context Is the New Attack Surface

A benchmark score is easy to quote. It is harder to know what broke. In Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models, Pavlos Ntais reports an 81.0% attack success rate against GPT-OSS-20B on a held-out 200-item test set.1 That number is attention-grabbing. It is also not the main lesson. ...

Jailbreak and Enter: Why LLM Security Needs a Cube, Not a Scoreboard

Opening — Why this matters now The AI industry has spent the last two years teaching executives a strangely comforting phrase: “the model refused.” That phrase is now dangerously inadequate. A refusal is not a security architecture. It is a behavioral outcome under one prompt, one context window, one model version, one judge, and one assumption about what the attacker is trying to do. Change any of those variables and the safety story can change. Sometimes gently. Sometimes like a glass door discovering what gravity does. ...

CRaFT and the Illusion of Safety: When ‘Sorry’ Is Just a Circuit

A refusal is easy to recognize. The model says it cannot help. The sentence sounds polite. The compliance team relaxes for three seconds. Everyone moves on. That is the comfortable version of AI safety: refusal as an observable behavior. The uncomfortable version is that refusal may be only the visible end of a much narrower internal computation. If that computation can be found, isolated, and steered, then the model’s “sorry, I can’t assist with that” is not a moral boundary. It is a circuit behavior. Very reassuring, in the same way a locked glass door is reassuring before someone points out the hinge. ...

The Ethics Stress Test: When AI Morality Cracks Under Pressure

A support ticket does not usually arrive as a clean moral philosophy exercise. It arrives as a complaint marked urgent. Then the customer adds that a manager already approved something questionable. Then a sales team wants the answer phrased in a way that protects revenue. Then the user says there is no time to escalate. Five turns later, the AI assistant is no longer answering the original question. It is swimming inside pressure, ambiguity, and incentives. ...