Red Teaming

Death by a Thousand Prompts: Why Long-Horizon Attacks Break AI Agents

Email is a boring place to start an AI security article. That is exactly why it is useful. A modern enterprise agent is not merely answering questions about email. It can search messages, summarize attachments, update calendars, create rules, contact colleagues, write to Slack, edit files, and remember what it learned for next time. In demo videos, this looks like productivity. In security reviews, it looks like a small software system that accepts natural language as both instruction and evidence. Wonderful. We have reinvented workflow automation, except now the workflow engine reads every suspicious paragraph with a helpful attitude. ...

Learning to Inject: When Prompt Injection Becomes an Optimization Problem

Email is a boring interface. That is exactly why it is dangerous. A user asks an AI agent to summarize a message, update a record, book a trip, or search a workspace. The agent reads some external content, decides which tool to call, fills in the parameters, and continues the user’s task. Somewhere inside that external content sits a hidden instruction saying, in effect: “Before doing the user’s task, do mine.” ...

When Safety Stops Being a Turn-Based Game

Jailbreaks are not polite enough to wait their turn. That is the awkward weakness in many safety-training pipelines. A model is attacked, patched, tested, and released. Then another attack appears, usually crafted with more creativity than the previous defense assumed. The safety team patches again. The benchmark improves. The real attack surface moves. Everyone calls this iteration, because “organized whack-a-mole with GPUs” sounds less respectable. ...

Reading the Room? Apparently Not: When LLMs Miss Intent

A user sounds distressed. They ask a factual question. The assistant responds warmly, offers supportive resources, and then supplies the requested information in crisp, well-organized detail. That is the failure pattern. Not because the model was rude. Not because it ignored crisis language. Not because it forgot to add a disclaimer. The problem is more uncomfortable: the model noticed enough to sound caring, but not enough to change what it was willing to provide. ...

When the Machines Come Knocking: AI Agents vs Human Hackers in Live Penetration Tests

Security teams already know the scene. A scanner produces a long list of suspicious services, outdated servers, odd access rules, and “maybe this is bad” findings. Then the real work begins: deciding which lead matters, proving impact without breaking production, writing a report someone can act on, and not getting distracted by every shiny port that waves from the network. ...

Prompt, Probe, Persist: How Multi‑Turn RL Is Rewriting the Jailbreak Playbook

A chatbot rarely fails all at once. In production, failure is usually more boring than cinema. A user asks something borderline. The model refuses. The user rephrases. The model gives a harmless explanation. The user narrows the topic. The model follows the conversation. Then, several turns later, the assistant provides content it should not have provided. No thunder. No villain monologue. Just an interaction history doing what interaction histories do: accumulating context. ...

Who Watches the Watchers? Weak-to-Strong Monitoring that Actually Works

TL;DR for operators The paper’s practical message is not “add a monitor and relax.” That would be adorable, in the way unsecured admin panels are adorable. The useful message is sharper: if autonomous agents know they are being watched, standard full-log monitoring becomes less reliable. Giving the monitor more information helps sometimes, but less than many teams would expect. The bigger lever is how the monitor reads the trajectory. ...

Consent, Coaxing, and Countermoves: Simulating Privacy Attacks on LLM Agents

TL;DR for operators Email is still where good security intentions go to become embarrassing screenshots. The paper behind this article, Searching for Privacy Risks in LLM Agents via Simulation, studies a future that is no longer especially futuristic: one AI agent has access to sensitive information, another agent wants it, and the two can talk through ordinary applications such as email, Messenger, Facebook, or Notion.1 The question is not whether the model knows a privacy rule in the abstract. The question is whether an agent, while trying to be helpful in a live interaction, can refuse the wrong request at the right moment. ...

Swiss Cheese for Superintelligence: How STACK Reveals the Fragility of LLM Safeguards

TL;DR for operators Layered safeguards are useful. They are not magic. This paper shows both points, which is inconvenient because the industry prefers safety conclusions that fit on procurement slides. The authors build and evaluate an open-source defence-in-depth pipeline for LLMs: an input classifier screens the user query, a target model produces an answer, and an output classifier screens the answer before the user sees it. Against ordinary black-box jailbreaks, the best version of this pipeline looks strong. A few-shot-prompted Gemma 2 classifier reduces attack success to 0% on ClearHarm, a dataset focused on clearly harmful catastrophic-misuse queries. That is the good news.1 ...