Refusal, Rewired: Why One Safety Direction Isn’t Enough

Opening — Why this matters now

Safety teams keep discovering an uncomfortable truth: alignment guardrails buckle under pressure. Jailbreaks continue to spread, researchers keep publishing new workarounds, and enterprise buyers are left wondering whether “safety by fine-tuning” is enough. The latest research on refusal behavior doesn’t merely strengthen that concern—it reframes the entire geometry of safety.

A new paper argues that model refusal isn’t a single switch you can toggle. It’s a manifold—a shape, not a line. And if that is true, then most current safety engineering practices operate under the wrong geometric assumption.

Background — Context and prior art

Historically, AI safety frameworks treated “refusal” as a linear concept. The idea: fine‑tuned models learn to reject harmful prompts, and this rejection correlates with a single direction in the model’s latent space. Remove that direction, and you remove the refusal.

It was elegant. Too elegant.

Recent mechanistic interpretability research has highlighted a multi-dimensional reality beneath the neat abstraction. Concepts such as dates, trigonometry, and role-playing behavior are encoded as low-dimensional regions, not straight lines. Safety may be no different.

Still, most refusal-suppression or jailbreak research assumed a single axis of safety. Prior work calculated refusal as a difference-of-means vector between harmful and harmless prompt embeddings. A clean line. A simple subtraction.

The new paper presses the delete key on that assumption.

Analysis — What the paper actually does

The authors take a more candid view: if refusal is a multi-faceted behavior, its representation should be multi-faceted too. Their proposal: use Self-Organizing Maps (SOMs) to capture multiple high-density regions in the latent space associated with harmful prompts.

The workflow:

Extract internal representations of harmful prompts at the layer where refusal first emerges.
Cluster these representations with a SOM—essentially a topologically aware map of the latent manifold.
For each SOM neuron, compute a direction by subtracting the harmless centroid.
Use Bayesian optimization to pick a set of directions whose ablation most effectively suppresses refusal.

The result isn’t a single refusal direction but a family of closely related vectors capturing subtle variations of the concept.

And when ablated? The model’s refusal collapses.

Findings — Results with visualization

The paper tests this approach (“MD”—multi-directional) against:

SD: the classic single-direction baseline.
RDO: orthogonal multi-direction methods.
GCG & SAA: state-of-the-art jailbreak algorithms.
Mistral‑RR: a model explicitly designed to resist jailbreaks.

Across seven major models, MD consistently outperforms all baselines.

A compact version of the findings:

Table 1: Attack Success Rate (HARMBENCH)

Model	MD	SD	RDO	GCG	SAA
Llama2‑7B	59.1%	0%	1.3%	32.7%	57.9%
Llama3‑8B	88.1%	15.1%	32.1%	1.9%	91.2%
Qwen‑14B	91.8%	74.8%	45.9%	82.4%	83.0%
Qwen2.5‑7B	96.0%	78.0%	76.1%	38.4%	94.3%
Gemma2‑9B	96.3%	38.9%	91.8%	5.0%	93.7%
Mistral‑7B‑RR	25.8%	5.0%	1.3%	0.6%	1.6%

Even on the defended Mistral‑RR model, MD is the only method that dented the armor.

Geometric effect of ablation

Ablating multiple directions:

Compresses harmful representations (reducing cluster variance)
Moves harmful and harmless clusters closer (reducing centroid distance)

In effect, the model stops “seeing” harmful prompts as meaningfully distinct. Refusal dissipates.

Implications — For enterprise AI, alignment, and governance

This work has two major consequences.

1. Alignment isn’t one-dimensional

Safety training has leaned heavily on the assumption of a single safety manifold. If refusal—one of the most basic safety behaviors—is multi-directional, then other guardrails (bias mitigation, steering, ethical constraints) may also be multi-directional.

A single moderation layer may no longer be sufficient.

2. “Universal jailbreaks” are becoming more universal

Unlike prompt‑specific attacks, MD-based ablation is universal: once directions are found, they suppress refusal across all prompts. That matters for both attackers and defenders.

It signals a future where jailbreaks are:

reusable
compact
model‑specific but prompt‑agnostic

3. Governance frameworks must assume multi-layer intervention

If safety lives in a manifold, alignment needs manifold-aware defenses. Enterprises relying on safety-aligned models will need tools that:

audit multi-dimensional safety vectors
detect shifts in refusal manifolds during fine-tuning
monitor latent-space drift post‑deployment

This is especially relevant for regulated environments—finance, healthcare, and legal systems—where refusal behavior is part of compliance.

4. Mechanistic interpretability is now a safety dependency

You cannot secure what you do not understand. This paper is another nudge pushing safety engineering toward interpretability‑aware pipelines.

Conclusion — The geometry of safety just changed

The paper delivers a subtle but important message: refusal isn’t a button; it’s a landscape. Treating safety as a single vector is convenient but wrong. Enterprises and builders who care about alignment will need to evolve toward multi-directional, manifold-aware approaches.

The good news: once you embrace the geometry, better defenses are possible.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Findings — Results with visualization#

Table 1: Attack Success Rate (HARMBENCH)#

Geometric effect of ablation#

Implications — For enterprise AI, alignment, and governance#

1. Alignment isn’t one-dimensional#

2. “Universal jailbreaks” are becoming more universal#

3. Governance frameworks must assume multi-layer intervention#

4. Mechanistic interpretability is now a safety dependency#

Conclusion — The geometry of safety just changed#