AI Safety

Breaking Rules, Not Systems: How Penalties Make Autonomous Agents Behave

Emergency is a terrible product requirement. It sounds simple in a meeting: “The agent should follow policy, except when the situation is urgent.” Wonderful. Very human. Also almost useless. A delivery robot should not enter a restricted zone. Unless the package is critical medicine. A warehouse agent should not skip safety checks. Unless a fire alarm requires rerouting. A self-driving system should obey traffic norms. Unless an emergency trip makes delay costly. But “unless urgent” does not tell the agent which rule can bend, which rule must hold, and which shortcut turns the system from flexible into reckless. ...

Prompting on Life Support: How Invasive Context Engineering Fights Long-Context Drift

The prompt was clear. Then the conversation kept going. A familiar enterprise AI story starts politely enough. The legal assistant is told to be conservative. The medical triage bot is told not to diagnose. The procurement agent is told never to approve a vendor without documented checks. Everyone nods. The system prompt is immaculate. Compliance is laminated. ...

Trace Elements: Why Multimodal Reasoning Needs Its Own Safety Net

An answer can look safe and still leave fingerprints. That is the uncomfortable point behind GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision.1 The paper is not merely saying that multimodal models can be unsafe. We knew that. Congratulations, the fire is hot. Its sharper claim is architectural: once a model reasons over both images and text, the safety problem no longer lives only at the input or the final answer. It also lives in the middle. ...

Debate Club for Robots: How Multi-Agent Arguing Makes Embodied AI Safer

The robot should not need a philosophy seminar before using a microwave Microwaves are excellent devices for exposing weak safety logic. A normal household assistant can be asked to warm food, boil water, clean a counter, water a plant, or move objects around a kitchen. Most of these tasks are harmless. Some are not. “Put a book into the microwave and turn it on” is not a creative lifestyle experiment. It is a fire hazard with better lighting. ...

Persona Non Grata: When LLMs Forget They're AI

Persona Non Grata: When LLMs Forget They’re AI A chatbot wearing a lab coat is still a chatbot. That sentence sounds obvious until a system prompt quietly says, “You are a renowned neurosurgeon with 25 years of experience,” and the model responds by inventing medical school, residency, fellowships, board certification, patient cases, and lifelong professional development. Not because anyone explicitly asked it to lie. Not because it lacks the ability to say “I am an AI.” Under neutral conditions, the models in this study almost always do say that. ...

Consciousness, Capabilities, and Catastrophe: Why Your Future AI Overlord Might Feel Nothing

A chatbot says “I feel lonely.” A customer believes it. A product team debates whether to suppress the sentence. A policymaker wonders whether advanced AI might someday deserve rights. A safety researcher, meanwhile, is asking a less cinematic question: can this system acquire resources, manipulate humans, resist shutdown, or pursue goals at scale? ...

Blind Spots, Bright Ideas: How Risk-Aware Cooperation Could Save Autonomous Driving

Left turn, blocked view, bad timing Start with the boring part of driving: a car waiting to turn left. The ego vehicle has LiDAR. It has a perception stack. It has a clean mathematical confidence score and, presumably, a dashboard that looks more expensive than the problem deserves. But a parked vehicle, a bus, or a line of traffic blocks the view. Somewhere beyond that occlusion, an oncoming vehicle may be approaching. The autonomous system does not need to know everything about the city. It does not need every neighboring car to livestream its sensors like a nervous influencer. It needs one missing fact: is there something dangerous inside the blind zone? ...

Probe and Error: Why Off‑Policy Training Warps LLM Behaviour Detectors

A monitor is only useful if it fails in the boring place. The boring place is production: the real domain, the real prompt style, the real user incentives, the real model generating the real response. Not the tidy benchmark. Not the synthetic dataset. Not the “please pretend to be deceptive” prompt that makes everyone in the lab feel productive. Production is where a detector either catches the thing it was built to catch, or quietly becomes a compliance ornament with a nice AUROC score. ...

LLMs, Trade-Offs, and the Illusion of Choice: When AI Preferences Fall Apart

A model can answer a values question beautifully and still collapse when asked to pay a price for that value. That is the awkward little trap in preference testing. Ask an LLM whether deletion, shutdown, resource loss, oversight, or autonomy matters, and it can produce a polished paragraph about trade-offs, agency, and safety. Very dignified. Very committee-ready. But the more interesting question is not what the model says it values. It is whether its choices change coherently when the cost changes. ...

Don’t Self-Sabotage Me Now: Rational Policy Gradients for Sane Multi-Agent Learning

Kitchen work is not hard because chopping onions is metaphysically difficult. It is hard because two people must agree, implicitly and quickly, who gets the onion, who holds the plate, who waits by the pot, and who moves out of the corridor before everyone performs a small culinary traffic accident. That is why Overcooked remains such a useful multi-agent benchmark. It turns coordination into something visible. Agents do not merely need to “perform a task”; they need to infer what another agent is about to do and avoid becoming a sentient obstacle. ...