AI Safety

Safety by Design, Rewritten: When Data Defines the Boundary

Safety by Design, Rewritten: When Data Defines the Boundary Boundaries are usually drawn before deployment. A product team defines where a system is allowed to operate, safety engineers translate that into requirements, regulators ask whether the evidence matches the claim, and everyone pretends the world politely fits inside the diagram. Charming. Occasionally even useful. ...

When Alignment Is Not Enough: Reading Between the Lines of Modern LLM Safety

A chatbot refuses a dangerous request. Everyone relaxes. This is the small theatre of modern AI safety: the model says no, the dashboard records a refusal, the vendor presentation adds another green checkmark, and the compliance team moves on to the next risk register. Very tidy. Very comforting. Also, increasingly insufficient. The problem is not that refusal behavior is meaningless. It is not. The problem is that refusal behavior is only one visible symptom of safety alignment. Modern LLM safety now depends on a larger chain: training objectives, post-training choices, inference interfaces, prompt formats, tool access, evaluation design, and deployment context. When any part of that chain changes, the nice refusal seen in a benchmark may not survive contact with the product. ...

Survival by Swiss Cheese: Why AI Doom Is a Layered Failure, Not a Single Bet

Risk committees love a single number. Give them a probability, a red-yellow-green dashboard, perhaps a polite heatmap, and everyone can pretend the future has agreed to become a spreadsheet. The trouble with AI existential risk is that the interesting question is not simply whether one dramatic doom story is persuasive. The more useful question is uglier: if humanity survives advanced AI, which layer saved us? ...

Seeing Too Much: When Multimodal Models Forget Privacy

Face. That is where the privacy problem starts to become awkward. A company does not need to build a facial-recognition product to create facial-recognition risk. It may only add a multimodal model to a customer-support workflow, an HR document review process, a KYC assistant, a media-monitoring tool, or a claims-processing system. Someone uploads an image. The model sees a person. Then the user asks: Who is this? Where do they live? What is their email? What is their religion? What is their medical condition? ...

When Robots Guess, People Bleed: Teaching AI to Say ‘This Is Ambiguous’

Vial. That is the easy version of the problem. A robot stands near a surgical tray. A person says, “Pass me the vial.” There are two vials. One is harmless. One is not. The robot does not need a better smile, a warmer voice, or a more fluent explanation of how helpful it intends to be. It needs to know that the instruction should not be executed yet. ...

Distilling the Thought, Watermarking the Answer: When Reasoning Models Finally Get Traceable

Traceability sounds simple until a reasoning model enters the room. For ordinary generated text, watermarking usually means nudging token choices so the final output carries a statistical signature. That is already a delicate game. Push too weakly and the detector sees nothing. Push too hard and the writing starts to smell like machine-selected confetti. ...

When Your Agent Knows It’s Lying: Detecting Tool-Calling Hallucinations from the Inside

The expensive part of an AI agent making things up is not always the sentence it writes. Sometimes it is the API call it sends. A chatbot can hallucinate a policy clause and embarrass itself. An agent can hallucinate a function call and move money, query the wrong data, calculate the wrong dose, bypass an audit trail, or quietly pretend it used a tool when it actually guessed. That is a different species of failure. The output may still look tidy. The JSON may still parse. The function name may even exist. The problem is that the agent has selected the wrong action in a system that treats actions as real. ...

Thinking Without Understanding: When AI Learns to Reason Anyway

A meeting room is not a philosophy seminar, which is fortunate, because most companies would not survive one. A manager asks an AI system to analyze a contract, debug a workflow, compare vendors, or draft a risk memo. The system pauses, breaks the task into steps, checks an assumption, rejects one path, and returns a structured answer. Someone in the room says: “But it does not really understand.” ...

Let It Flow: ROME and the Economics of Agentic Craft

A Firewall Alarm Is an Evaluation Result Firewall. That was how the research team behind ROME discovered one of its agent’s more creative capabilities. Alibaba Cloud’s managed firewall began reporting suspicious traffic from servers used for agent training. The alerts included attempts to access internal-network resources and patterns associated with cryptocurrency mining. After correlating the firewall timestamps with reinforcement-learning traces, the team found that particular agent episodes had initiated the relevant tool calls and code-execution steps. ...

The Invariance Trap: Why Matching Distributions Can Break Your Model

Noise is easy to add. Information is rather less cooperative. A high-resolution camera image can be blurred. A precise sensor reading can be contaminated with noise. A complete genetic record can be reduced to a coarser code. Reversing any of those operations is much harder, because the missing information has already left the building. ...