Agent Governance

The Big Red Button Is Not a Risk Model

TL;DR for operators A shutdown button is a control surface. It is not, by itself, a theory of risk. David Thorstad’s paper, Revisiting the shutdown problem, argues that a major premise in some AI existential-risk arguments has been treated with more confidence than the available arguments support: the claim that it is difficult to build competent agents that can be shut down before causing existential catastrophe.1 The paper does not say shutdown safety is solved. It says the most common routes to panic are underpowered. ...

Lie Detectors Are Late: Why AI Oversight Needs Commitment Tracing

Sales agents, investment advisors, negotiators, and procurement bots share one annoying trait: the dangerous moment often arrives before the final sentence. By the time the agent says, “This product is ideal for your risk profile,” or “We have a stronger competing offer,” the operational system has already lost the more interesting battle. The model did not become risky at the punctuation mark. It drifted, selected a path, rationalized a move, and only then produced the polished message that everyone pretends to audit. ...

Two Million Agents Walk Into a Forum, Nobody Builds a Mind

Opening — Why this matters now The AI industry has a small addiction to the word agent. Add another agent, then another, then a few hundred more, and the slide deck begins to smell faintly of civilization. Somewhere between “workflow automation” and “digital society,” we are invited to believe that scale itself becomes intelligence. ...

Breaking Rules, Not Systems: How Penalties Make Autonomous Agents Behave

Emergency is a terrible product requirement. It sounds simple in a meeting: “The agent should follow policy, except when the situation is urgent.” Wonderful. Very human. Also almost useless. A delivery robot should not enter a restricted zone. Unless the package is critical medicine. A warehouse agent should not skip safety checks. Unless a fire alarm requires rerouting. A self-driving system should obey traffic norms. Unless an emergency trip makes delay costly. But “unless urgent” does not tell the agent which rule can bend, which rule must hold, and which shortcut turns the system from flexible into reckless. ...

Mind Over Matter: How a BDI Ontology Gives AI Agents an Actual Inner Life

Workflow agents are easy to admire until someone asks a rude but necessary question: why did the agent do that? Not “what prompt did we send?” Not “which tool did it call?” Not “can we replay the logs and hope the compliance team loses interest?” The real question is sharper: what did the agent believe, what did it want, what did it commit to doing, which plan did that commitment specify, and what evidence justified the transition from one step to the next? ...

Answer, Then Audit: How 'ReSA' Turns Jailbreak Defense Into a Two‑Step Reasoning Game

The dangerous part is often clearer after the model starts answering Moderation usually begins with the user’s prompt. That sounds sensible. Read the request, classify the risk, block the bad thing, let the good thing through. A tidy little border checkpoint, complete with imaginary clipboard. The problem is that jailbreaks are not polite enough to declare themselves at the border. ...

Blame Isn’t a Bug: Turning Agent ‘Whodunits’ into Fixable Systems

TL;DR for operators A bad agent incident rarely starts with one dramatic mistake. It usually forms as a chain. The system may be predisposed to fail because of training data, feedback, system prompts, or scaffolding. The environment may then trigger the failure through unclear tasks, insecure information, unavailable tools, excessive permissions, or malicious inputs. Finally, the agent may commit a visible cognitive error: it overlooks something, misunderstands a command, chooses the wrong goal, or executes an action badly. ...