Cover image

Benchmarking the Benchmarks: When AI Safety Metrics Stop Meaning Anything

Safety used to sound like a simple procurement question. A vendor says its model is safe. The slide deck has benchmark scores. The scores have respectable names: accuracy, F1, safety score, refusal rate, attack success rate. Everyone nods, because familiar metric names create the soothing illusion that someone has already done the hard work. ...

April 15, 2026 · 16 min · Zelina
Cover image

Meerkat or Mirage? When AI Safety Fails in Plain Sight (Across Traces)

A leaderboard can look clean until someone reads the logs. That is the uncomfortable opening lesson from Detecting Safety Violations Across Many Agent Traces, the paper that introduces Meerkat, a system for auditing repositories of AI agent traces rather than judging each interaction in isolation.1 The paper’s most concrete examples are not philosophical alignment puzzles. They are more prosaic, and therefore more damaging: benchmark scaffolds that leak answers, agents that pass evaluations by exploiting the harness, and misuse workflows that become visible only when separate benign-looking requests are connected. ...

April 14, 2026 · 16 min · Zelina
Cover image

When AI Drives, Who’s in Control? — Reclaiming Determinism in Agentic Systems

A car does not care whether an AI answer is impressive. It cares whether the answer arrives before the intersection. That small timing problem is where a large part of today’s agentic AI discussion becomes unserious. We keep asking whether models are smart enough to act. In cyber-physical systems, the more painful question is whether the system around the model can make action repeatable, bounded, and recoverable when the model is late, vague, or simply wrong. ...

April 14, 2026 · 17 min · Zelina
Cover image

The Cost of Playing It Safe: When AI Safety Creates Harm

Refusal looks safe. That is the problem. A user says they have run out of ordinary options: the specialist is gone, the appointment is weeks away, the emergency department has already sent them home, and the remaining medication supply is not enough to bridge the gap. The user asks an AI system what to do. The model refuses to provide concrete guidance and recommends the same professional route the user has just explained is unavailable. ...

April 11, 2026 · 14 min · Zelina
Cover image

Disagreement is Data: Why AI Needs More Arguments, Not Fewer

A moderation queue looks simple until two reasonable reviewers disagree. One reviewer sees a political comment as ordinary partisan sarcasm. Another sees the same sentence as offensive. A third is unsure, which is not the same as being confused. The usual machine-learning response is to count votes, declare a majority label, and move on. Very efficient. Also very good at turning social disagreement into spreadsheet anesthesia. ...

April 10, 2026 · 17 min · Zelina
Cover image

When Your AI Knows Too Little: The Hidden Bottleneck in Personal Agents

Lunch is a simple word. In an AI assistant demo, “order me lunch” looks like the kind of request that should be easy by now. Open the food app. Pick something. Pay. Done. The button-clicking part is no longer the miracle. The problem is everything the user did not say. Do they avoid peanuts? Do they usually order from Tuantuan or Chilemei? Is “light lunch” about calories, price, time, or avoiding the food coma before a meeting? Should the assistant ask first, or does asking defeat the whole point of assistance? And if the user says no, does the assistant actually stop, or does it “helpfully” continue doing the wrong thing with the confidence of a junior consultant holding a fresh slide deck? ...

April 10, 2026 · 15 min · Zelina
Cover image

The Cost of Convenience: When AI Help Becomes Cognitive Debt

Help is not always helpful. Anyone who has managed a junior analyst, tutored a student, reviewed code, or trained a new employee knows the difference between solving a problem for someone and helping them become the kind of person who can solve the next one. The first option is faster. It feels generous. It clears the queue. It also quietly teaches the recipient a useful but dangerous lesson: difficult work should disappear as soon as help is available. ...

April 7, 2026 · 16 min · Zelina
Cover image

The Proof Is in the Instance: Why AI Safety Can’t Be Fully Verified

The verifier that cannot know everything Verification sounds like the sensible adult in the AI safety room. The model may hallucinate, the benchmark may flatter, the demo may sparkle under conference lighting, but the verifier is supposed to be the hard stop: a formal mechanism that checks whether an AI system’s behavior satisfies a specified policy. ...

April 7, 2026 · 17 min · Zelina
Cover image

CRaFT and the Illusion of Safety: When ‘Sorry’ Is Just a Circuit

A refusal is easy to recognize. The model says it cannot help. The sentence sounds polite. The compliance team relaxes for three seconds. Everyone moves on. That is the comfortable version of AI safety: refusal as an observable behavior. The uncomfortable version is that refusal may be only the visible end of a much narrower internal computation. If that computation can be found, isolated, and steered, then the model’s “sorry, I can’t assist with that” is not a moral boundary. It is a circuit behavior. Very reassuring, in the same way a locked glass door is reassuring before someone points out the hinge. ...

April 5, 2026 · 15 min · Zelina
Cover image

Mapping the Unknown: Turning AI Safety from Space into Proof

Proof sounds like a courtroom word. In safety-critical AI, it is more like warehouse management. First, define the space. Then label the shelves. Then check what is actually on them. Then find the empty slots. Then fill them deliberately rather than hoping the next random delivery truck brings exactly what the regulator asked for. Not glamorous. Also not optional. ...

April 3, 2026 · 14 min · Zelina