Cover image

Judo, Not Armor: Strategic Deflection as a New Defense Against LLM Jailbreaks

TL;DR for operators Most LLM safety systems still assume that, when a model sees a harmful request, the correct behaviour is refusal. That works until the attacker stops arguing with the prompt and starts interfering with generation itself. The paper behind this article, Strategic Deflection: Defending LLMs from Logit Manipulation, proposes SDeflection: a fine-tuning method that teaches a model to answer in a safe, topic-adjacent way rather than relying only on explicit refusal language.1 The model does not provide harmful instructions. It redirects the subject toward harmless information that is close enough to the original topic to survive attacks that try to force compliance-style openings. ...

July 31, 2025 · 16 min · Zelina