Cover image

Scaling Trust, Not Just Models: Why AI Safety Must Be Quantitative

TL;DR for operators The paper’s practical message is simple enough to be uncomfortable: “use a smarter model to supervise the risky model” is not a safety strategy. It is an experiment waiting to be measured. Engels, Baek, Kantamneni, and Tegmark propose a way to measure scalable oversight as a two-player contest between a Guard and a Houdini.1 The Guard is the overseer: auditor, judge, monitor, containment system, or reviewer. The Houdini is the model trying to defeat oversight: deceive, persuade, insert a backdoor, or escape a simulated control environment. Each side receives a domain-specific Elo score, and the paper studies how that score changes as general model capability increases. ...

April 29, 2025 · 17 min · Zelina