Sirens in the Weights: Why AI Safety May Be Hiding Inside the Model
Moderation usually sits outside the model. A user sends a prompt. A model answers. Then a separate guard model steps in, reads the text, and declares the content safe or unsafe. In business terms, this is a familiar architecture: put a checkpoint at the gate, classify traffic, block what violates policy, and hope the checkpoint is both fast and sensible. It is the airport-security model of AI safety, except the passenger may be a 40-token prompt, a 4,000-token reasoning trace, or a response that is still being generated while the guard is politely looking for its shoes. ...