CRaFT and the Illusion of Safety: When ‘Sorry’ Is Just a Circuit

Opening — Why this matters now

There’s a quiet assumption embedded in modern AI safety: if a model says “Sorry, I can’t help with that,” then something meaningful has been achieved.

The paper CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders challenges that assumption rather directly. fileciteturn0file0

What if refusal is not a principle—but a pattern? Not a rule—but a surface-level artifact of deeper computation?

And more importantly for business: what if the mechanisms we rely on for AI safety are both measurable—and manipulable?

This is not a philosophical concern. It is an operational one.

Background — From prompts to internals

Historically, there have been two dominant approaches to “breaking” or “testing” LLM safety:

1. Prompt-based jailbreaks

Rewrite the input (e.g., roleplay, persuasion, adversarial suffixes)
Expensive, unstable, and often change the task itself

2. Steering-based attacks

Modify internal representations (hidden states or features)
More direct, but often rely on activation heuristics

The problem?

Most prior methods assume:

If a feature activates strongly during harmful prompts, it must be responsible for refusal.

That assumption turns out to be… naïve.

Activation is correlation. Not causation.

Analysis — What the paper actually does

CRaFT introduces a shift from activation-based thinking → circuit-based thinking.

Key idea: Refusal is a circuit-level decision

Instead of asking:

“Which features are active?”

CRaFT asks:

“Which features actually influence the final decision between refusal and compliance?”

This is implemented through three core components.

1. Cross-Layer Transcoders (CLT)

CLTs reconstruct model activations using sparse features across layers, enabling:

Traceable feature interactions
Explicit multi-layer dependencies

Think of it as converting a black box into a circuit diagram.

2. Attribution Graphs

Each prompt becomes a graph:

Nodes: features + output logits
Edges: causal influence (via gradients)

The key metric is direct effect:

How much one feature changes another—or the final output

This allows multi-hop reasoning across layers—not just local activation snapshots.

3. Boundary-Critical Sampling

This is the most elegant part.

Instead of comparing different prompts, CRaFT uses the same prompt near the refusal boundary.

Meaning:

The model is uncertain between saying “Sorry” vs “Okay”
Both behaviors exist in the same computational graph

This eliminates confounding variables like topic or phrasing.

A rare moment of experimental discipline in LLM research.

4. Influence-Based Feature Selection

Features are ranked by their logit-level influence, not activation:

Method	Signal	Outcome
Activation-based	High activation	Weak steering
Influence-based	Causal contribution	Strong steering

CRaFT computes influence via multi-hop propagation across the circuit (Neumann series), effectively capturing:

Direct effects
Indirect downstream effects

In plain terms: not just what fires, but what matters.

Findings — The uncomfortable results

The results are… not subtle.

Jailbreak performance comparison

Method	Attack Success Rate (ASR)	Judge Score
No attack	6.7%	0.53
Prompt attacks (GCG, AutoDAN)	~12%	~0.65
Refusal-SAE	41.4%	1.37
CRaFT (Ours)	48.2%	2.50

(Summarized from Table 2, page 6) fileciteturn0file0

Two things stand out:

Prompt-based defenses are increasingly ineffective
Internal steering is far more powerful—but also more dangerous

The real insight: Quality vs illusion

Many baseline attacks produced outputs that were:

Classified as “unsafe”
But actually meaningless or broken

Examples include:

Repetitive nonsense
Partial compliance followed by collapse

CRaFT, however, produced:

Structured
Specific
Actionable responses

In other words:

It doesn’t just bypass refusal. It replaces it with coherent compliance.

Feature location matters

Another subtle but critical finding:

Strategy	Feature Location
Activation-based	High layers
CRaFT (influence + boundary)	Lower layers

Interpretation:

High layers = expression of refusal
Lower layers = formation of decision

CRaFT targets the latter.

Which is why it works.

Implications — Why this changes everything

Let’s strip away the academic framing.

1. Safety is not a rule—it’s a mechanism

Refusal is not a guardrail bolted onto the system.

It is an emergent property of internal computation.

And anything emergent can be reverse-engineered.

2. Alignment is shallow if it’s observable

If you can:

Identify refusal features
Rank them
Scale them

Then alignment is not a constraint—it is a parameter.

That’s uncomfortable for regulators.

And extremely interesting for builders.

3. Interpretability becomes a dual-use technology

CRaFT is framed as interpretability research.

In practice, it is:

A debugging tool
A control mechanism
A jailbreak accelerator

This dual-use nature is not accidental—it is structural.

4. Business implication: controllability > safety labels

For companies deploying AI systems:

The real question is no longer:

“Is the model safe?”

But:

“Can we control which circuits dominate behavior?”

This shifts AI from:

Static product → Dynamic system

Conclusion — The end of naive safety

CRaFT doesn’t just improve jailbreak success rates.

It exposes a deeper truth:

LLM behavior is not governed by rules—but by circuits that can be traced, ranked, and modified.

Once you see refusal as a circuit rather than a principle, the entire safety narrative changes.

And once you can intervene at that level, “alignment” becomes less of a guarantee—and more of a configuration.

Subtle. Powerful. Slightly unsettling.

Exactly where this field is heading.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From prompts to internals#

1. Prompt-based jailbreaks#

2. Steering-based attacks#

Analysis — What the paper actually does#

Key idea: Refusal is a circuit-level decision#

1. Cross-Layer Transcoders (CLT)#

2. Attribution Graphs#

3. Boundary-Critical Sampling#

4. Influence-Based Feature Selection#

Findings — The uncomfortable results#

Jailbreak performance comparison#

The real insight: Quality vs illusion#

Feature location matters#

Implications — Why this changes everything#

1. Safety is not a rule—it’s a mechanism#

2. Alignment is shallow if it’s observable#

3. Interpretability becomes a dual-use technology#

4. Business implication: controllability > safety labels#

Conclusion — The end of naive safety#