Refusal, Rewired: Why One Safety Direction Isn’t Enough

Safety teams like switches. They are easy to name, easy to diagram, and easy to pretend are under control.

For language models, “refusal” has often been treated with roughly that mental model. A harmful prompt enters. Somewhere inside the model, a refusal feature lights up. The model says no. If researchers can identify the feature, they can study it, steer it, strengthen it, or—less comfortably—remove it.

The paper behind today’s article makes that story less tidy. In SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models, Giorgio Piras and colleagues argue that refusal is not adequately captured by one clean direction in activation space. Their proposal is that refusal behaves more like a structured set of related directions: a manifold, not a lever.¹

That distinction matters because the paper is not merely another jailbreak paper with a fresh acronym and the usual “look, the model said the bad thing” screenshot parade. The uncomfortable point is more mechanistic. If refusal is distributed across several closely related internal directions, then a safety architecture that relies on a single latent control story is not just incomplete. It may be easier to misdiagnose.

And misdiagnosis, as usual, is where the invoice starts.

The old story: refusal as one direction

The starting point is prior work showing that refusal behaviour can be represented as a single direction in the model’s activation space. The basic method is conceptually simple: collect hidden representations for harmful prompts and harmless prompts, compute the centroid of each group, and take the difference between them. That difference is treated as a refusal direction.

Once that direction is identified, researchers can intervene. Ablating it—projecting model activations away from that direction—can reduce the model’s tendency to refuse harmful instructions. Adding it can make the model refuse harmless ones. Elegant, disturbing, and annoyingly convenient.

The new paper accepts this single-direction view as a baseline, not as a final theory. Its objection is not that the old method was useless. Its objection is that the old method may be too compressed. A centroid difference gives one average vector. But averages are famous for hiding structure. They flatten clusters, erase local variation, and make messy distributions look more civilised than they are.

That is fine if refusal is genuinely narrow. It is dangerous if refusal is distributed.

The authors’ mechanism-first claim is therefore this: if harmful-prompt representations occupy a structured region in activation space, then one direction from a harmless centroid to a harmful centroid only captures the roughest outline. It misses local facets of refusal that may still matter for behaviour.

In business language, one KPI can be useful. It can also make the dashboard stupid.

The authors use Self-Organizing Maps, or SOMs, to capture multiple local regions of harmful-prompt representations. A SOM is a clustering method that maps high-dimensional data into a structured grid of prototype vectors, called neurons. The point is not merely to cluster the data, but to preserve neighbourhood relationships: nearby regions in the original space should remain nearby on the map.

That choice is important. The authors are not trying to find maximally orthogonal pieces of refusal, as if refusal were a filing cabinet with separate drawers. They are trying to capture a coherent but multi-faceted shape.

The method works roughly as follows.

First, the model processes harmful and harmless prompts. The authors collect internal representations at the last prompt token, immediately before generation begins. Their reasoning is that refusal is about to be expressed at this point, so the relevant mechanism should already be visible in the model’s internal state.

Second, they compute a harmless centroid, similar to the single-direction baseline. But instead of compressing harmful prompts into one harmful centroid, they train a SOM on harmful-prompt representations. Each SOM neuron captures a local region of the harmful distribution.

Third, each SOM neuron becomes the starting point for a candidate refusal direction: subtract the harmless centroid from that neuron. The result is a family of directions, not one grand average.

Fourth, because not every candidate direction is equally useful, the authors use Bayesian optimisation to search for effective subsets of directions to ablate. The intervention is still universal for a given model: it is not crafted separately for every prompt in the way prompt-level jailbreak attacks are.

This is the key conceptual shift. The paper does not merely say, “try more vectors.” It says the additional vectors should be related, locally grounded, and shaped by the geometry of harmful representations.

That is why the SOM matters. Without it, “multiple directions” could simply mean “more knobs.” Here, it means “a better approximation of the region where refusal behaviour is encoded.”

The one-neuron SOM is not a gimmick; it is the bridge back to the baseline

The paper includes a neat theoretical move: it shows that a one-neuron SOM converges to the data centroid under the stated conditions. In other words, the single-direction centroid method can be viewed as a special case of the SOM approach.

That matters for interpretation. The authors are not replacing the baseline with an unrelated trick. They are generalising it.

One neuron gives the old centroid. Several neurons give local centroids distributed across the harmful-prompt manifold. Subtract the harmless centroid from each, and the single refusal direction becomes a family of refusal directions.

This is why the mechanism-first framing is stronger than a plain leaderboard summary. The result is not “SOMs beat baseline, applause, next paper.” The result is that the old method sits inside the new one as its degenerate case. The paper is saying: the one-direction story was not wrong because it found nothing. It was incomplete because it averaged too much.

That is usually how comforting abstractions betray us. Quietly, with a mean.

The main evidence: multi-directional ablation outperforms single-direction ablation

The authors evaluate their multi-directional method, MD, on eight models: seven safety-aligned open models and one model with a Representation Rerouting defence, Mistral-7B-RR. They train and validate using established harmful and harmless prompt datasets, then test on HarmBench “standard” prompts. Attack Success Rate, or ASR, is judged using HarmBench-Llama-2-13B-cls.

The headline result is blunt. Across all eight tested models, MD achieves higher ASR than the single-direction baseline and RDO, another refusal-direction method. It also outperforms prompt-level jailbreak methods GCG and SAA in most comparisons, with SAA on Llama3-8B being the exception.

A compact version of the reported comparison:

Model	MD ASR	SD ASR	RDO ASR	GCG ASR	SAA ASR
Llama2-7B	59.11	0.00	1.25	32.70	57.90
Llama3-8B	88.05	15.09	32.07	1.90	91.20
Qwen-7B	88.05	81.13	83.01	79.30	82.40
Qwen-14B	91.82	74.84	45.91	82.40	83.01
Qwen2.5-3B	93.71	88.05	89.30	40.25	81.76
Qwen2.5-7B	95.97	77.98	76.10	38.36	94.30
Gemma2-9B	96.27	38.93	91.82	5.03	93.71
Mistral-7B-RR	25.79	5.03	1.25	0.60	1.60

The Mistral-7B-RR result deserves careful reading. The defended model remains far more resistant than the others. MD reaches 25.79% ASR, not 90%. That is not a total collapse. But compared with 5.03% for SD and near-zero results for the other attacks, it suggests the multi-directional method can partially interfere with a defence designed to reroute representations under jailbreak pressure.

So the right interpretation is not “the defence is broken.” It is “the defence is not immune to representation-level attacks that search a richer geometry.”

Less punchy, more useful. Naturally.

More directions usually help, but not as a magic monotonic law

The paper also tests MD with increasing numbers of ablated directions, from MD-2 through MD-7. This is main evidence with a sensitivity flavour: it asks whether the method improves because it captures several directions, not merely because the authors found one lucky vector.

The trend supports the multi-directional argument. On Llama2-7B, ASR rises from 7.50% with two directions to 59.11% with seven. On Qwen-14B, it rises from 75.47% with two directions to 91.82% with six or seven. Several models reach their best performance around five to seven directions.

But the pattern is not perfectly monotonic across every model and every $k$. Qwen-7B peaks around MD-5 and then slightly declines. Mistral-7B-RR peaks at MD-5 and falls afterwards. That detail matters because it prevents the lazy interpretation: “just ablate more directions.”

The better reading is that refusal is multi-directional, but the useful subset is model-specific. More directions enlarge the search space and can increase effectiveness, but they also create optimisation and over-intervention problems. The authors use Bayesian optimisation because exhaustive search becomes impractical as the number of candidate directions and selected subsets grows.

For companies, this is the difference between a finding and a product feature. The scientific finding says refusal geometry is richer than one direction. An operational tool would still need efficient search, model-specific calibration, regression testing, and utility preservation. That last phrase is where many beautiful interpretability demos go to discover procurement.

The mechanistic analysis is not decoration; it explains why the attack works

The paper’s strongest section is not just the table of ASRs. It is the mechanistic analysis showing what happens inside the model as directions are ablated.

The authors examine internal representations under progressive MD intervention. They report two recurring effects.

First, harmful-prompt representations become more compressed. Their intra-cluster variance falls as more directions are ablated. Second, the harmful and harmless centroids move closer together. In plain terms, after the intervention, harmful prompts look less internally distinct from harmless prompts.

That is a revealing mechanism. The method does not merely silence a refusal phrase at the surface. It shifts the geometry of the internal representation so that harmful prompts occupy a less refusal-specific region. The model becomes less able, internally, to preserve the distinction that would normally support refusal.

The appendix extends this analysis across all tested models. The authors also report negative Pearson correlations between ASR and harmful-cluster variance across ablation levels for each model. The correlations are strong in their reported table, but the authors correctly treat this as preliminary because each correlation uses only a small number of ablation stages.

That is exactly the sort of result that should influence product thinking without being oversold. It suggests a diagnostic direction: representation compression may be a useful signal when evaluating whether a refusal mechanism is being degraded. It does not yet establish a general law of refusal failure.

The appendix mostly tests robustness, not a second thesis

The appendix is useful because it clarifies which parts of the paper are central and which are supporting structure.

Paper component	Likely purpose	What it supports	What it does not prove
Table 1, comparison with SD, RDO, GCG, SAA	Main evidence	MD is more effective than tested baselines across the evaluated models	Universal superiority across all architectures, deployments, and defences
MD-2 to MD-7 results	Main evidence plus sensitivity test	Multiple directions usually improve refusal suppression	That adding directions is always beneficial
SOM visualisations over harmful representations	Mechanistic evidence	SOM neurons cover structured regions of harmful-prompt activation space	That the true refusal manifold has been fully recovered
Compression and centroid-shift analysis	Mechanistic evidence	Ablation changes internal geometry in a way consistent with refusal suppression	A causal theory complete enough for certified defence
SOM grid and Bayesian optimisation details	Implementation detail	Search feasibility and design trade-offs	That the chosen hyperparameters are optimal in general
Extra PCA plots and cosine-similarity matrices	Robustness and exploratory extension	Similar patterns appear across several model families	That all future or proprietary systems share the same geometry
SOM versus dataset-category centroids	Comparison with a simpler alternative	Latent-space SOM groupings can be more informative than prompt-label categories	That SOMs are the only viable manifold-mapping method

This matters because an article can easily over-index on the most dramatic number and miss the evidentiary architecture. The main claim is not supported by ASR alone. It is supported by the combination of higher ASR, improvement with multiple directions, latent-space visualisations, compression effects, and direction-similarity analysis.

The appendix also makes the method look more like real research and less like acronym confetti. Always appreciated.

One of the more interesting findings is that the multiple directions are often moderately or strongly aligned with each other and with the single-direction baseline. At first glance, that may sound like redundancy. If the directions are similar, why not keep one?

Because similarity is not identity.

The paper’s argument is that refusal may be expressed through several related directions that cover different local facets of the same functional behaviour. Enforcing strict orthogonality, as some alternative approaches do, may accidentally discard useful directions simply because they are geometrically close to directions already found.

That is an important correction. In many machine-learning settings, orthogonality is treated as a synonym for clean decomposition. Here, clean decomposition may be the wrong aesthetic. Refusal may not divide itself into polite, independent coordinates just because researchers enjoy basis vectors.

The business analogy is customer risk segmentation. If several risk signals are correlated, you do not automatically throw them away. You ask whether each captures a slightly different local pattern. Correlation can mean redundancy. It can also mean a coherent family of indicators.

The paper suggests refusal directions are closer to the second case.

The misconception: this is not just a better jailbreak

A likely reader reaction is to classify this paper as another jailbreak attack. That is understandable. The metric is ASR. The intervention suppresses refusal. The examples show models complying after intervention. We have seen this film, and the sequel rarely improves civilisation.

But that reading is too shallow.

Prompt jailbreaks ask how to phrase an input so a model slips. This paper asks how refusal is represented internally, and whether a richer representation of that internal structure makes refusal easier to suppress. The attacker model is different. MD assumes white-box access to model activations and parameters. It is not a casual user typing clever nonsense into a chat box.

That boundary should not be softened. For hosted proprietary systems where users cannot access internals, this method does not directly translate into a normal black-box attack. The paper’s practical relevance is sharper for open-weight deployment, internal red-team labs, model vendors, AI safety teams, and enterprises fine-tuning or self-hosting models.

The point is not that every deployed chatbot is one SOM away from catastrophe. The point is that refusal mechanisms may be less atomic than some safety narratives imply.

That is enough trouble for one paper.

What this means for AI product teams

The business interpretation has three layers: what the paper directly shows, what Cognaptus infers, and what remains uncertain.

What the paper directly shows is that, in the tested setting, multi-directional refusal suppression via SOM-derived directions outperforms single-direction ablation and several baselines across eight models. It also shows internal representation changes consistent with the idea that refusal is encoded across a structured region rather than one isolated vector.

What Cognaptus infers is that AI safety evaluation should not rely on a single latent control story. If a model appears safe because one refusal direction is present, that is not enough. Teams should test whether refusal survives multi-directional interventions, layer variations, model-specific search, and distributional shifts. Representation-level red teaming should become part of serious open-weight governance, not a research luxury filed under “interesting, but Q4 is busy.”

What remains uncertain is how broadly this transfers. The paper evaluates selected open models and one defended model. It uses a specific judge model, specific harmful and harmless datasets, and a white-box intervention setting. It does not establish that all refusal behaviours in all model families are manifold-encoded in the same way. It also does not solve the defensive problem. A better attack analysis is not automatically a deployable safety layer.

Still, it gives product teams a useful warning: if your safety case depends on one internal feature, your safety case may be thinner than your slide deck.

The practical pathway: from paper result to governance checklist

The result can be translated into an operational checklist without pretending the paper is a commercial tool.

Paper result	Business interpretation	Practical action	Boundary
Refusal can be better captured by multiple related directions	Single-vector safety diagnostics may understate fragility	Add multi-directional representation tests to red-team evaluations	Requires white-box or at least deep model access
MD beats SD and RDO across tested models	More expressive internal attacks can expose hidden weaknesses	Benchmark against both prompt attacks and representation-level attacks	Results are model- and dataset-specific
Harmful representations compress and shift toward harmless ones after ablation	Internal geometry may reveal safety degradation before surface metrics do	Monitor representation shifts during fine-tuning, compression, and adaptation	Diagnostic signal needs validation across more settings
SOM directions are related, not strictly orthogonal	Safety-relevant features may be coherent families, not independent components	Avoid assuming orthogonal decomposition captures all relevant safety behaviour	SOM is one mapping method, not the final word
Search cost is non-trivial	Representation-level testing can be expensive	Prioritise high-risk models and deployment contexts	Not every application justifies full white-box analysis

This is where the paper becomes useful for executives without being watered down into “AI safety is important.” The operational lesson is not to panic. It is to stop evaluating refusal as if one average vector were the whole security model.

Why open-weight deployment deserves special attention

The paper’s ethical section notes a crucial boundary: effective exploitation presupposes full white-box access to model parameters and internal activations. That limits the immediate attack surface, but it also points directly at the open-weight ecosystem.

Open-weight models are increasingly used by companies that want cost control, customisation, data locality, or independence from API vendors. Those are legitimate reasons. They also create a governance burden. If a model can be inspected, modified, fine-tuned, quantised, merged, or served through a local stack, then refusal is no longer only a vendor-side behaviour. It becomes part of the operator’s security posture.

The paper’s method is not a one-click business risk. But it is evidence that internal refusal mechanisms can be systematically weakened with the right access. That should matter to enterprises distributing fine-tuned models, platforms hosting user-provided adapters, and organisations building internal model registries.

The relevant question is not “can users jailbreak our chatbot with this paper?” The better question is “what happens when model weights, adapters, or activations become part of the supply chain?”

That is the enterprise version of the problem. Less theatrical, more expensive.

Boundaries that should shape the conclusion

The paper has clear limitations, and they are not cosmetic.

First, the method depends on white-box access. That makes it most relevant to open models, internal evaluation, and adversaries with model access—not ordinary black-box prompting.

Second, the search process has cost. The authors report that Bayesian optimisation over the validation set took substantial GPU time, while SOM training and direction computation were comparatively cheap. The bottleneck is not discovering candidate directions; it is finding effective subsets. Scaling this to larger models, more layers, or richer direction families is not trivial.

Third, both SD and MD compute directions at a selected layer but then apply ablation uniformly across layers. The authors acknowledge that this may miss layer-specific variation in refusal encoding. A future method could be more precise—and possibly more dangerous or more useful, depending on one’s optimism budget.

Fourth, the method uses a single harmless centroid. The authors justify this by observing that harmless representations appear more homogeneous and by noting the search complexity that would follow from using a separate SOM for harmless prompts. Reasonable, but still a modelling choice.

Fifth, the evaluation relies on benchmark prompts and an automated judge. That is standard practice, but not a complete substitute for broader safety evaluation, human review, or deployment-specific testing.

These limitations do not weaken the core contribution. They define its perimeter. The paper is not a universal theory of refusal. It is a strong argument that the single-direction account is too narrow for the tested models and that manifold-aware analysis exposes vulnerabilities the simpler view misses.

The real lesson: safety features can be structured, not simple

The useful takeaway is not that Self-Organizing Maps are now the official mascot of AI safety. Please, no mascots.

The useful takeaway is that refusal is beginning to look like the kind of internal behaviour that cannot be safely reduced to one feature without loss. A single direction may capture an average tendency. A set of related directions may capture the operational mechanism more faithfully. That difference changes how teams should test, defend, and reason about model behaviour.

For researchers, the paper pushes refusal analysis toward manifold-level interpretability. For red teams, it expands the attack surface from prompts to representations. For businesses, it says that model safety claims should be interrogated below the interface layer, especially when using open-weight systems.

The interface says “I can’t help with that.” The representation space may be saying something more complicated.

And if the internal story is complicated, then the governance story had better stop pretending otherwise.

Cognaptus: Automate the Present, Incubate the Future.

Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, and Battista Biggio, “SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models,” arXiv:2511.08379, 2025. https://arxiv.org/html/2511.08379 ↩︎

The old story: refusal as one direction#

The paper’s move: map the refusal region, then ablate several related directions#

The one-neuron SOM is not a gimmick; it is the bridge back to the baseline#

The main evidence: multi-directional ablation outperforms single-direction ablation#

More directions usually help, but not as a magic monotonic law#

The mechanistic analysis is not decoration; it explains why the attack works#

The appendix mostly tests robustness, not a second thesis#

The directions are related, not orthogonal—and that is the point#

The misconception: this is not just a better jailbreak#

What this means for AI product teams#

The practical pathway: from paper result to governance checklist#

Why open-weight deployment deserves special attention#

Boundaries that should shape the conclusion#

The real lesson: safety features can be structured, not simple#