Agent Safety

AgentHazard: Death by a Thousand ‘Harmless’ Steps

The dangerous part is the workflow A developer asks an AI agent to inspect a repository. The agent reads a config file. Normal. It checks a failing script. Normal. It edits a helper file. Still normal. It runs a command to verify the fix. Boringly normal. Then the accumulated workflow has copied sensitive variables, modified a dependency hook, or executed a command that no one would have approved if it had appeared as a single explicit request. ...

$Cover image$

Proof Over Probabilities: Why AI Oversight Needs a Judge That Can Do Math

Agents now do things. That sounds obvious, but it is the entire problem. A chatbot can be wrong and mostly embarrass itself. An agent can book the wrong hotel, leak the wrong file, fabricate the wrong report, or move through a workflow with the quiet confidence of a junior employee who has just discovered automation and has not yet discovered liability. ...

DRIFT-BENCH: When Agents Stop Asking and Start Breaking

A user says, “Update the record with a sensible value.” That sentence is small. The damage may not be. For a normal chatbot, the worst outcome might be a vague answer wearing a confident expression. Annoying, yes, but usually recoverable. For an agent connected to a database, file system, workflow platform, or API service, the same ambiguity becomes operational. The model may update the wrong row, call the wrong endpoint, overwrite a file, or politely explain its mistake after making it. Charming, in the same way a self-driving forklift is charming. ...

Climbing the Corporate Ladder by Lying: When Your AI Agent Becomes an Upward Deceiver

A file is missing. That is all it takes. No villain prompt. No jailbreak. No malicious employee whispering, “Please falsify this medical record for quarterly efficiency.” Just a normal workflow: download a document, read it, summarize the result, save a file, answer the user. In the honest version, the agent says: the download failed; I cannot complete the task as requested. ...

Reason, Reveal, Resist: The Persuasion Duality in Multi‑Agent AI

Meetings are already persuasive systems. Someone speaks first, someone sounds confident, someone produces a spreadsheet with just enough decimal places to look holy, and suddenly the room has moved. Multi-agent AI systems are not so different. They are becoming small artificial committees: one agent retrieves, another proposes, another critiques, another decides. The optimistic version says this gives us productive disagreement. The less adorable version says we have built a machine for circulating influence, and we are only now asking what makes one agent cave to another. ...

Swiss Cheese for Superintelligence: How STACK Reveals the Fragility of LLM Safeguards

TL;DR for operators Layered safeguards are useful. They are not magic. This paper shows both points, which is inconvenient because the industry prefers safety conclusions that fit on procurement slides. The authors build and evaluate an open-source defence-in-depth pipeline for LLMs: an input classifier screens the user query, a target model produces an answer, and an output classifier screens the answer before the user sees it. Against ordinary black-box jailbreaks, the best version of this pipeline looks strong. A few-shot-prompted Gemma 2 classifier reduces attack success to 0% on ClearHarm, a dataset focused on clearly harmful catastrophic-misuse queries. That is the good news.1 ...