When Models Know They’re Wrong: Catching Jailbreaks Mid-Sentence
Guardrails usually fail quietly. A user sends a malicious prompt. The model begins answering. The safety policy that looked firm in the demo environment starts behaving like office wallpaper: present, decorative, and not especially involved. By the time a post-hoc filter reads the final answer, the model has already produced the thing it should not have produced. The system may block the response from the user, but the real lesson is less flattering: the model crossed the line before the defense noticed. ...