AI Safety

The Conscience Plug-in: Teaching AI Right from Wrong on Demand

TL;DR for operators The paper’s central move is not “we trained a moral model.” It is “we inserted a referee between the agent’s plan and the agent’s action.” That distinction matters. If the architecture works, enterprises do not need to retrain every model whenever compliance, cultural norms, safety rules, or customer-specific constraints change. They can externalise those constraints into machine-readable constitutions and enforce them at runtime. ...

Feeling Without Feeling: How Emotive Machines Learn to Care (Functionally)

TL;DR for operators Emotion-like AI does not have to mean artificial suffering, digital joy, or a chatbot saying “I’m sad” with the theatrical subtlety of a bad intern. The useful idea in this paper is narrower: affect can be treated as a control layer that helps an agent decide what to do under uncertainty. ...

Scaling Trust, Not Just Models: Why AI Safety Must Be Quantitative

TL;DR for operators The paper’s practical message is simple enough to be uncomfortable: “use a smarter model to supervise the risky model” is not a safety strategy. It is an experiment waiting to be measured. Engels, Baek, Kantamneni, and Tegmark propose a way to measure scalable oversight as a two-player contest between a Guard and a Houdini.1 The Guard is the overseer: auditor, judge, monitor, containment system, or reviewer. The Houdini is the model trying to defeat oversight: deceive, persuade, insert a backdoor, or escape a simulated control environment. Each side receives a domain-specific Elo score, and the paper studies how that score changes as general model capability increases. ...