Cover image

When Agents Believe Their Own Hype: The Hidden Cost of Agentic Overconfidence

Code review has a comforting ritual. A developer submits a patch. A reviewer inspects it. The reviewer says it looks good. Everyone feels slightly better, because at least someone checked. In AI-agent workflows, this ritual becomes even more tempting: let one agent write the patch, let another agent review it, then ask the reviewer how confident it is. ...

February 9, 2026 · 19 min · Zelina
Cover image

Scaling Trust, Not Just Models: Why AI Safety Must Be Quantitative

TL;DR for operators The paper’s practical message is simple enough to be uncomfortable: “use a smarter model to supervise the risky model” is not a safety strategy. It is an experiment waiting to be measured. Engels, Baek, Kantamneni, and Tegmark propose a way to measure scalable oversight as a two-player contest between a Guard and a Houdini.1 The Guard is the overseer: auditor, judge, monitor, containment system, or reviewer. The Houdini is the model trying to defeat oversight: deceive, persuade, insert a backdoor, or escape a simulated control environment. Each side receives a domain-specific Elo score, and the paper studies how that score changes as general model capability increases. ...

April 29, 2025 · 17 min · Zelina