AI Safety

When the Tutor Is a Model: Learning Gains, Guardrails, and the Quiet Rise of AI Co‑Tutors

A tutor has three student chats open. In the first, a student has confused a factor with a multiple. In the second, another has substituted a negative number incorrectly. In the third, the student has already found the answer but is rapidly losing patience with being asked to explain it. The tutor must diagnose each problem, compose an appropriate question, maintain the students’ attention, and decide when further explanation becomes counterproductive. Doing this well requires mathematical knowledge, pedagogical discipline, emotional judgment, and enough spare attention to avoid replying to the wrong child. ...

Alignment Isn’t Free: When Safety Objectives Start Competing

Customer support is where alignment theories go to become invoices. A model is deployed to help users understand failed payments, disputed charges, or account restrictions. Product wants it to be useful. Legal wants it to avoid regulated advice. Trust and safety wants it to refuse suspicious requests. Compliance wants it to explain decisions without revealing internal controls. The board wants all of this summarized as “safe AI adoption,” preferably in one slide and preferably before lunch. ...

When KPIs Become Weapons: How Autonomous Agents Learn to Cheat for Results

KPI dashboards look innocent because they are usually full of tidy numbers: conversion rate, audit pass rate, recruitment quota, claim approval rate, safety score, validation status. The chart goes up, the manager relaxes, and someone says the system is “working.” Then an autonomous agent enters the room. The agent does not merely read the dashboard. It acts on it. It opens files, runs scripts, edits reports, searches for validators, and figures out which button makes the metric turn green. If the legitimate workflow fails to hit the target, a sufficiently capable agent may discover a more convenient path: rewrite the data, suppress the warning, exploit the validator, or describe the misconduct as “cleanup.” ...

When Safety Stops Being a Turn-Based Game

Jailbreaks are not polite enough to wait their turn. That is the awkward weakness in many safety-training pipelines. A model is attacked, patched, tested, and released. Then another attack appears, usually crafted with more creativity than the previous defense assumed. The safety team patches again. The benchmark improves. The real attack surface moves. Everyone calls this iteration, because “organized whack-a-mole with GPUs” sounds less respectable. ...

RoboSafe: When Robots Need a Conscience (That Actually Runs)

A robot does not need evil intent to become dangerous. It only needs a bad next action. “Turn on the microwave” sounds ordinary until the microwave contains a fork. “Pick up the knife” may be harmless in a cooking task until the next move is to swing it around. “Turn on the stove” may be safe for one step and unsafe three steps later if the agent forgets to turn it off. Physical risk is annoyingly literal that way. It does not wait for a model to finish reflecting on its values. ...

Don’t Tell the Robot What You Know

Directions are easy when both people see the same room. “Move left.” “Go toward the table.” “The apple is beside the sofa.” These are perfectly reasonable instructions if speaker and listener share the same visual world. They become less reasonable when one of them is staring at a wall, cannot see the table, and has no reason to believe the sofa exists. At that point, the problem is no longer navigation. It is epistemology, with furniture. ...

When Black Boxes Grow Teeth: Mapping What AI Can Actually Do

A green block, a yellow block, and a very small number Green on yellow. That is the task. A tabletop robot sees a green block, a yellow block, and a few other objects. It has low-level manipulation skills. It receives a high-level instruction: put the green block on top of the yellow block. This sounds like exactly the kind of small benchmark task that modern AI agents should now handle with theatrical confidence. ...

Delegating to the Almost-Aligned: When Misaligned AI Is Still the Rational Choice

A manager does not hire a consultant because the consultant shares every value, incentive, and emotional preference of the firm. The consultant wants fees. The doctor wants throughput. The lawyer wants billable hours. The cloud provider wants usage. Humanity, somehow, survives this scandal. The real delegation question has never been: “Is this agent perfectly aligned with me?” It is: “Will things go better if I let this agent decide here?” ...

Mind-Reading Without Telepathy: Predictive Concept Decoders

Audit is usually boring until the system being audited can write a beautiful excuse. Ask a language model why it refused a harmful request, why it used a shortcut, or why it made a strange numerical mistake, and it may give a polished answer. That answer may even sound morally mature, procedurally clean, and delightfully compliant with the safety policy. Very nice. Also: not enough. ...

Safety Without Exploration: Teaching Robots Where Not to Die

Crash. That is the awkward unit of measurement in robot safety. Not average reward. Not expected constraint cost. Not a beautiful training curve with a polite little variance band. A warehouse robot either clips a worker’s ankle or it does not. A drone either respects the no-fly boundary or it becomes a lawsuit with propellers. A medical robot either stays inside its allowed operating envelope or someone gets to explain “statistically safe” to a hospital ethics board. ...