Cover image

Answer, Then Audit: How 'ReSA' Turns Jailbreak Defense Into a Two‑Step Reasoning Game

TL;DR Reasoned Safety Alignment (ReSA) reframes safety from guarding inputs to auditing intended outputs. The model first drafts a concise intended answer summary in hidden reasoning, then runs a safety analysis on that summary before issuing the final reply. In evaluations across StrongREJECT, HarmBench, and AdvBench with multiple adaptive attacks (PAIR, PAP, GPTFuzzer, ReNeLLM, TAP, DeepInception), ReSA‑tuned models beat fine‑tuned and post‑hoc baselines while reducing over‑refusals and preserving reasoning performance. Notably, authors report competitive gains with only ~500 training samples, hinting that robust safety behaviors may be learned data‑efficiently. ...

September 20, 2025 · 5 min · Zelina