When Models Know They’re Wrong: Catching Jailbreaks Mid-Sentence
Opening — Why this matters now Most LLM safety failures don’t look dramatic. They look fluent. A model doesn’t suddenly turn malicious. It drifts there — token by token — guided by coherence, momentum, and the quiet incentive to finish the sentence it already started. Jailbreak attacks exploit this inertia. They don’t delete safety alignment; they outrun it. ...