Cover image

The Big Red Button Is Not a Risk Model

TL;DR for operators A shutdown button is a control surface. It is not, by itself, a theory of risk. David Thorstad’s paper, Revisiting the shutdown problem, argues that a major premise in some AI existential-risk arguments has been treated with more confidence than the available arguments support: the claim that it is difficult to build competent agents that can be shut down before causing existential catastrophe.1 The paper does not say shutdown safety is solved. It says the most common routes to panic are underpowered. ...

June 24, 2026 · 21 min · Zelina
Cover image

Context Collapse: Why AI’s Next Bottleneck Is Knowing What Matters

TL;DR for operators AI is getting fluent enough to be dangerous in boring ways. It can describe a scene, generate a video, and write a policy memo with impressive confidence. The problem is that real operations rarely fail at the level of generic fluency. They fail when the system confuses which person did what, blends event one into event two, or treats a documented atrocity as a debate club prompt because a user asked for “balance”. ...

June 17, 2026 · 17 min · Zelina
Cover image

Mind the Tail: Quantum Rare-Event Sampling Without the Discovery Tax

TL;DR for operators Risk teams do not only need more samples. They need samples from the part of the distribution that almost never appears until it ruins the quarter, the grid, the model launch, or the compliance meeting. The paper behind this article, Quantum enhanced rare event discovery and sampling, proposes a quantum algorithm for doing exactly that: sample from outcomes whose probabilities are below a threshold $\Delta$, without first identifying the rare set by brute force.1 ...

June 16, 2026 · 16 min · Zelina
Cover image

Cheap Seats, Sharp Eyes: Reward-Hack Detection Without the Frontier Judge

TL;DR for operators A frontier LLM judge is an expensive way to inspect every agent trajectory for reward hacking. This paper asks whether a much smaller detector can do most of that monitoring job at much lower cost. The answer is: yes, under the same information condition, and with important caveats. A 13.8M-parameter transformer encoder plus a logistic regression probe detects reward hacking in cleaned Terminal-Wrench trajectories with 0.9467 AUC and 0.8296 TPR@5%FPR. In the authors’ matched comparison, a reproduced gpt-5.4 judge reaches 0.9510 AUC and 0.7130 TPR@5%FPR on the cleaned sanitized-vs-baseline split.1 ...

June 15, 2026 · 6 min · Zelina
Cover image

Lie Detectors Are Late: Why AI Oversight Needs Commitment Tracing

Sales agents, investment advisors, negotiators, and procurement bots share one annoying trait: the dangerous moment often arrives before the final sentence. By the time the agent says, “This product is ideal for your risk profile,” or “We have a stronger competing offer,” the operational system has already lost the more interesting battle. The model did not become risky at the punctuation mark. It drifted, selected a path, rationalized a move, and only then produced the polished message that everyone pretends to audit. ...

June 12, 2026 · 17 min · Zelina
Cover image

Trust Issues, Benchmarked: Why Hallucination Detection Is a Portfolio Problem

Trust is a bad deployment strategy. That is not a moral statement. It is an operations statement. In most enterprise AI workflows, the uncomfortable question is not “Can the model answer?” The model will answer. Models are generous like that. The question is whether the organization has a reliable way to notice when the answer is unsupported, fabricated, overconfident, or merely polished nonsense wearing a tie. ...

June 10, 2026 · 16 min · Zelina
Cover image

The Policy Has to Work Somewhere: RL for Scale, Trust, and Other Inconveniences

Deployment is where elegant AI systems go to meet bandwidth caps, slow devices, noisy user preferences, and privacy policies written by committees with very strong coffee. That is the useful lens for reading Guangchen Lan’s dissertation, Reinforcement Learning for Scalable and Trustworthy Intelligent Systems.1 It is tempting to describe the work as a collection of four reinforcement-learning methods: one for synchronous federated RL, one for asynchronous federated RL, one for preference optimization, and one for contextual privacy. Technically, that is true. Editorially, it is the least interesting way to read it. ...

June 8, 2026 · 21 min · Zelina
Cover image

Mind the Slot: Jailbreak Prompts Have Weak Points, Not Just Bad Words

Security teams like to search for suspicious strings. That habit is understandable. Strings are visible. They can be logged, filtered, matched, scored, and proudly displayed in dashboards. A bad suffix at the end of a prompt looks like a bad suffix at the end of a prompt. Convenient. Almost too convenient. The problem is that prompts are not flat text boxes. They are transformed into token sequences, wrapped in chat templates, and passed through attention layers that do not treat every position equally. Some positions receive more influence over the model’s next-token behavior than others. Put adversarial tokens there, and the same amount of “badness” can travel farther. ...

June 6, 2026 · 19 min · Zelina
Cover image

Don’t Just Guard the Door: Jailbreak Safety Needs Checkpoints

Don’t Just Guard the Door: Jailbreak Safety Needs Checkpoints A single prompt classifier is an attractive idea because it is simple, cheap, and easy to draw in a system diagram. The user sends a prompt. The guard says safe or unsafe. The model either answers or refuses. Very tidy. Also, increasingly incomplete. ...

May 30, 2026 · 15 min · Zelina
Cover image

The Confidence Trick: When Long AI Reasoning Arrives Too Early

A model gives you a long answer. It lists assumptions. It walks through steps. It sounds patient, organized, and slightly overqualified for the task. In a business setting, that style is comforting. A compliance analyst sees a neat explanation. A finance team sees a transparent calculation. A product manager sees “reasoning.” Everyone relaxes a little. ...

May 29, 2026 · 19 min · Zelina