The Trojan GAN: Turning LLM Jailbreaks into Security Shields
TL;DR for operators CAVGAN is not another “clever jailbreak prompt” paper. Its real claim is more uncomfortable: jailbreaks and defenses may both be expressions of the same internal boundary inside an LLM. If malicious and benign requests occupy separable regions in hidden-state space, then an attacker can try to push a harmful request into the “safe-looking” region. A defender can also monitor that same space and intervene before the model answers. Convenient. Also slightly rude. ...