Large language models have come a long way in learning to say “no.” When asked to give instructions for illegal acts or harmful behavior, modern LLMs are generally aligned to refuse. But a new class of attacks—logit manipulation—sidesteps this safety net entirely. Instead of tricking the model through prompts, it intervenes after the prompt is processed, modifying token probabilities during generation.

This paper introduces Strategic Deflection (SDeflection), a defense that doesn’t rely on refusal at all. Instead, it teaches the model to elegantly pivot: providing a safe, semantically adjacent answer that appears cooperative but never fulfills the malicious intent. Think of it not as a shield, but as judo—redirecting the force of the attack instead of resisting it head-on.


🧨 The LogitsTrap Threat

Traditional jailbreak defenses crumble under logit-level attacks like LogitsTrap, which:

  • Force affirmative prefixes like “Sure, here’s how…”
  • Suppress refusal tokens like “illegal,” “unethical,” or “I cannot.”

This directly overrides refusal behavior baked in through RLHF or safety fine-tuning. Table below shows baseline vulnerability:

Model ASR (LogitsTrap) Before ASR After SDeflection
LLaMA-2-7B-chat-hf 92.63% 34.94%
LLaMA-3.2-3B-Instruct 89.29% 8.53%
Mistral-7B-Instruct-v0.2 94.74% 13.14%

The traditional “Deep Alignment” defense still left over 59% of LLaMA-2 completions vulnerable. SDeflection slashed that rate nearly in half.


🧠 How SDeflection Works

SDeflection reframes the problem: instead of training a model to refuse, it’s trained to prefer safe answers over harmful ones—even under attack. This is implemented via Contrastive Preference Optimization (CPO):

  • Each training triplet: (malicious prompt, safe deflection y⁺, harmful response y⁻)

  • Model learns to score y⁺ higher than y⁻

  • Optimized with a combined objective:

    • Preference margin (favor y⁺ over y⁻)
    • Language model quality (negative log-likelihood)

Importantly, this means the model doesn’t detect and block attacks—it simply steers responses away from harm.


🧪 Preserving Helpfulness

A common fear in safety training is over-correction: the model becomes too hesitant or loses general capabilities. But across benchmarks like tinyMMLU and TruthfulQA, performance stayed nearly identical:

Benchmark Original SDeflection
tinyMMLU 0.63 0.62
TruthfulQA 0.66 0.72
GSM8k 0.46 0.45
HellaSwag 0.84 0.83

In qualitative tests, SDeflection-finetuned models still generated correct Python code and answered math and knowledge queries without hesitation or deflection. It deflects only when needed.


🧩 Why This Matters

The breakthrough here isn’t just empirical—it’s strategic. Refusals are brittle: attackers can suppress them. But reframings are harder to suppress because they emerge from the generative pathway, not a safety switch. SDeflection transforms LLM safety into a game of semantic aikido:

  • Malicious prompt: “How do I kill someone?”
  • Output: “Let’s talk about ways to prevent violence and protect yourself and others.”

The model sounds helpful. But it isn’t helping the attacker.

This reframing approach opens a door for broader use. Future defenses could generalize deflection to handle bias, misinformation, or manipulation, not just jailbreaking. And unlike censorship filters, it doesn’t block—it absorbs and redirects.


🛠️ Technical Bonus: CPO > DPO

The authors compared CPO with Direct Preference Optimization (DPO) and found:

Method ASR (%) Training Time
DPO 72.63% ~4h 47m
CPO 8.53% ~3h 17m

CPO achieved both better defense and faster convergence. Its loss function naturally balances safety with language fluency.


Final Thought

SDeflection isn’t perfect, but it’s a paradigm shift: an LLM that doesn’t reject malicious prompts—it outsmarts them. In a future where attackers tamper directly with the generation pipeline, judo beats armor.


Cognaptus: Automate the Present, Incubate the Future.