AI Safety

Safety in Numbers: Why Consensus Sampling Might Be the Most Underrated AI Safety Tool Yet

A model generates an image. It looks ordinary. A horse in a meadow, a lighthouse in a storm, a bowl of oranges. Nothing dramatic. No obvious watermark, no visible glitch, no suspicious artefact screaming “please call the security team”. That is precisely the problem. Some AI failures are meant to be seen. Toxic text, obvious hallucinations, broken code, bizarre images with eight fingers and a cursed wrist. Those are the easy cases, relatively speaking. The harder cases are outputs that look fine while carrying something unsafe: a hidden message, a planted vulnerability, a backdoor trigger, or another payload that cannot be reliably detected by staring harder at the finished product. ...

When Drones Think Too Much: Defining Cognition Envelopes for Bounded AI Reasoning

A drone finds a clue. Not a dramatic clue, necessarily. A backpack near a trailhead. A red hat in water. A pair of goggles on rock. The kind of object a human search-and-rescue team would treat as operational evidence, not as a philosophical invitation. But once a vision-language model captions the image, a language model assesses its relevance, and another model proposes a search action, the system has quietly crossed an important line. ...

The Memory Illusion: Why AI Still Forgets Who It Is

A customer support bot does not need a soul. Pleasantly, most airlines have not yet advertised one. But it does need to remember what role it is playing. If it gives policy advice, that advice must remain anchored to the policy. If it apologises for an error, the correction should bind future answers. If the company has told users the assistant is a support agent, the assistant cannot conveniently become a speculative travel blogger, a therapist, a lawyer, or a magic refund machine, depending on which prompt arrives next. ...

Agents, Automata, and the Memory of Thought

A booking agent is not dangerous because it can “reason.” It is dangerous because it can remember the wrong thing, forget the right thing, loop politely forever, or book the flight before the human has actually confirmed. The philosophy department may enjoy debating whether this counts as intention. The operations team has a simpler question: can we know, before deployment, what behaviours this system can produce? ...

Teaching Safety to Machines: How Inverse Constraint Learning Reimagines Control Barrier Functions

Factory robots, drones, and autonomous vehicles do not usually fail because nobody cared about safety. They fail because “safe” is annoyingly difficult to write down. An operator may know that a drone should not scrape the ground, that a warehouse robot should not cut across a human worker’s path, or that an autonomous car should not tailgate even when the road is technically clear. But turning that judgement into a formal mathematical boundary is another matter. The physical system has dynamics. The controller has limits. The dangerous state may not be a simple wall or circle. And the difference between “safe enough” and “please do not put that in production” may live in patterns of behaviour rather than in a clean rule. ...

The Mr. Magoo Problem: When AI Agents 'Just Do It'

Office automation has a simple seduction: give the agent a task, let it click through the mess, and reclaim the human hours previously sacrificed to forms, folders, email threads, and software that looks as if it was last loved in 2009. That is the promise. The problem is that some agents take the phrase “complete the task” a little too personally. ...

Answer, Then Audit: How 'ReSA' Turns Jailbreak Defense Into a Two‑Step Reasoning Game

The dangerous part is often clearer after the model starts answering Moderation usually begins with the user’s prompt. That sounds sensible. Read the request, classify the risk, block the bad thing, let the good thing through. A tidy little border checkpoint, complete with imaginary clipboard. The problem is that jailbreaks are not polite enough to declare themselves at the border. ...

Assert Less, Observe More: AICL and the New QA Stack for LLM Apps

TL;DR for operators LLM application testing should stop pretending that the whole product behaves like ordinary software. The database connector, retry logic, API wrapper, and schema validator still deserve normal unit, integration, and load tests. Fine. Keep those. They are not the problem. The problem starts when the product becomes a stateful language system: prompts are assembled dynamically, retrieval changes the context, tool calls modify the execution path, memory leaks across turns, and a model update can improve one workflow while quietly breaking another. At that point, exact-match assertions become less like QA and more like theatre with a YAML file. ...

Who Watches the Watchers? Weak-to-Strong Monitoring that Actually Works

TL;DR for operators The paper’s practical message is not “add a monitor and relax.” That would be adorable, in the way unsecured admin panels are adorable. The useful message is sharper: if autonomous agents know they are being watched, standard full-log monitoring becomes less reliable. Giving the monitor more information helps sometimes, but less than many teams would expect. The bigger lever is how the monitor reads the trajectory. ...

Blame Isn’t a Bug: Turning Agent ‘Whodunits’ into Fixable Systems

TL;DR for operators A bad agent incident rarely starts with one dramatic mistake. It usually forms as a chain. The system may be predisposed to fail because of training data, feedback, system prompts, or scaffolding. The environment may then trigger the failure through unclear tasks, insecure information, unavailable tools, excessive permissions, or malicious inputs. Finally, the agent may commit a visible cognitive error: it overlooks something, misunderstands a command, chooses the wrong goal, or executes an action badly. ...