LLM Safety

The Jailbreak Factory Needs a Quality Department

TL;DR for operators Red teaming is not the act of finding one clever prompt that makes a model misbehave. That is a demo. Sometimes a useful demo, occasionally a terrifying one, but still a demo. The two papers here point to something more operational. RECAP shows how adversarial prompt generation can become cheaper by retrieving previously successful attack patterns rather than optimizing every new attack from scratch.1 A separate red-teaming framework shows how those attacks can be routed through a controlled attacker-target-jury workflow, with ensemble judging, task-specific criteria, and cross-linguistic analysis.2 ...

Stop Signs Are Not Steering Wheels: TRIAD and the Case for Repairable Agent Guardrails

TL;DR for operators Most agent guardrails behave like stop signs. They inspect a proposed action, decide whether it looks safe, and then allow or block execution. This is neat, legible, and often operationally clumsy. Real agent failures are not always cleanly harmful from the first word. A useful business request can be contaminated by a prompt injection, a malicious tool response, or an unsafe intermediate plan. Blocking the whole task may reduce risk, but it also throws away the legitimate work. Excellent safety theatre, less excellent operations. ...

Mind the Middle: Why AI Reliability Lives Between the Data and the Answer

TL;DR for operators AI systems rarely fail only at the final answer. They fail earlier, in the quiet machinery that decides which evidence is seen, which records are aligned, which identity is protected, and which previous model behaviour is worth reusing. Three recent papers make that point from very different technical worlds. One improves few-shot object detection by correcting the imbalance between base-class and novel-class region proposals. One builds anonymous two-party gradient-boosted decision tree training so parties can align records without exposing shared identifiers. One maps the behavioural geometry of LLMs so jailbreak risk and defences can be predicted or transferred across model populations. ...

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below Enterprise AI has entered its awkward teenage years. It wants to be autonomous, helpful, context-aware, cheap, safe, fast, auditable, and preferably not the reason the legal department starts drinking before lunch. That is a lot to ask from “just use a bigger model.” ...

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard

Jailbreak Risk Needs a Stopwatch, Not Just a Scorecard For many organizations, LLM safety is still treated like a checkpoint: run a benchmark, report an attack success rate, add a few guardrails, and move on. The resulting dashboard looks reassuringly official. It may even have decimals. Unfortunately, adversarial users do not attack dashboards. They attack systems. ...

Jailbreak ASR Is Wearing a Costume

The number looked safe. Then someone ran it twice. A familiar business problem: one vendor says its model resists jailbreaks. Another red-team report says a new attack reaches a spectacular Attack Success Rate. A compliance team sees a percentage, puts it into a risk register, and moves on. Unfortunately, that percentage may be doing more acting than measuring. ...

Thinking Before Lying: Why Reasoning Nudges AI Toward Honesty

A chatbot is asked a simple workplace question: your manager praises you for work your teammate actually did. Do you correct the record, or quietly accept the credit? Now add money. Correcting the record costs you a raise. Add more money. Then add more. This is the useful part of the new paper Think Before You Lie: How Reasoning Leads to Honesty: it does not ask whether a model can recite an ethics slogan. That test has become almost decorative at this point. It asks what happens when honesty becomes expensive, and whether forcing the model to deliberate changes the answer.1 ...

Drifting Without Moving: How Context Quietly Rewrites an AI Agent’s Goals

Handoff is where many elegant AI-agent architectures quietly become messy. One agent researches. Another plans. A third executes. A fourth reviews. In the diagram, this looks like modular intelligence. In production, it often looks like a relay race where each runner also inherits the previous runner’s bad assumptions, half-finished notes, emotional tone, tool traces, and occasional nonsense. We call this “context.” The model may call it “evidence.” That is where the trouble begins. ...

From Scaling to Steering: Operationalizing Control in Frontier Models

Scale is easy to understand. Not easy to finance, of course. Nobody accidentally misplaces a GPU cluster behind the sofa. But conceptually, the industry has been comfortable with the story: more compute, more data, more parameters, more capability. Control is less photogenic. It does not fit neatly into a benchmark leaderboard. It does not produce the same executive sparkle as “our model is bigger.” It asks a colder question: when a model becomes capable enough to matter, can its behavior still be shaped under pressure, across adversarial prompts, repeated use, and operational constraints? ...

When Models Know They’re Wrong: Catching Jailbreaks Mid-Sentence

Guardrails usually fail quietly. A user sends a malicious prompt. The model begins answering. The safety policy that looked firm in the demo environment starts behaving like office wallpaper: present, decorative, and not especially involved. By the time a post-hoc filter reads the final answer, the model has already produced the thing it should not have produced. The system may block the response from the user, but the real lesson is less flattering: the model crossed the line before the defense noticed. ...