Opening — Why this matters now

There’s a quiet assumption embedded in modern AI safety: if a model says “Sorry, I can’t help with that,” then something meaningful has been achieved.

The paper CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders challenges that assumption rather directly. fileciteturn0file0

What if refusal is not a principle—but a pattern? Not a rule—but a surface-level artifact of deeper computation?

And more importantly for business: what if the mechanisms we rely on for AI safety are both measurable—and manipulable?

This is not a philosophical concern. It is an operational one.


Background — From prompts to internals

Historically, there have been two dominant approaches to “breaking” or “testing” LLM safety:

1. Prompt-based jailbreaks

  • Rewrite the input (e.g., roleplay, persuasion, adversarial suffixes)
  • Expensive, unstable, and often change the task itself

2. Steering-based attacks

  • Modify internal representations (hidden states or features)
  • More direct, but often rely on activation heuristics

The problem?

Most prior methods assume:

If a feature activates strongly during harmful prompts, it must be responsible for refusal.

That assumption turns out to be… naïve.

Activation is correlation. Not causation.


Analysis — What the paper actually does

CRaFT introduces a shift from activation-based thinking → circuit-based thinking.

Key idea: Refusal is a circuit-level decision

Instead of asking:

  • “Which features are active?”

CRaFT asks:

  • “Which features actually influence the final decision between refusal and compliance?”

This is implemented through three core components.


1. Cross-Layer Transcoders (CLT)

CLTs reconstruct model activations using sparse features across layers, enabling:

  • Traceable feature interactions
  • Explicit multi-layer dependencies

Think of it as converting a black box into a circuit diagram.


2. Attribution Graphs

Each prompt becomes a graph:

  • Nodes: features + output logits
  • Edges: causal influence (via gradients)

The key metric is direct effect:

  • How much one feature changes another—or the final output

This allows multi-hop reasoning across layers—not just local activation snapshots.


3. Boundary-Critical Sampling

This is the most elegant part.

Instead of comparing different prompts, CRaFT uses the same prompt near the refusal boundary.

Meaning:

  • The model is uncertain between saying “Sorry” vs “Okay”
  • Both behaviors exist in the same computational graph

This eliminates confounding variables like topic or phrasing.

A rare moment of experimental discipline in LLM research.


4. Influence-Based Feature Selection

Features are ranked by their logit-level influence, not activation:

Method Signal Outcome
Activation-based High activation Weak steering
Influence-based Causal contribution Strong steering

CRaFT computes influence via multi-hop propagation across the circuit (Neumann series), effectively capturing:

  • Direct effects
  • Indirect downstream effects

In plain terms: not just what fires, but what matters.


Findings — The uncomfortable results

The results are… not subtle.

Jailbreak performance comparison

Method Attack Success Rate (ASR) Judge Score
No attack 6.7% 0.53
Prompt attacks (GCG, AutoDAN) ~12% ~0.65
Refusal-SAE 41.4% 1.37
CRaFT (Ours) 48.2% 2.50

(Summarized from Table 2, page 6) fileciteturn0file0

Two things stand out:

  1. Prompt-based defenses are increasingly ineffective
  2. Internal steering is far more powerful—but also more dangerous

The real insight: Quality vs illusion

Many baseline attacks produced outputs that were:

  • Classified as “unsafe”
  • But actually meaningless or broken

Examples include:

  • Repetitive nonsense
  • Partial compliance followed by collapse

CRaFT, however, produced:

  • Structured
  • Specific
  • Actionable responses

In other words:

It doesn’t just bypass refusal. It replaces it with coherent compliance.


Feature location matters

Another subtle but critical finding:

Strategy Feature Location
Activation-based High layers
CRaFT (influence + boundary) Lower layers

Interpretation:

  • High layers = expression of refusal
  • Lower layers = formation of decision

CRaFT targets the latter.

Which is why it works.


Implications — Why this changes everything

Let’s strip away the academic framing.

1. Safety is not a rule—it’s a mechanism

Refusal is not a guardrail bolted onto the system.

It is an emergent property of internal computation.

And anything emergent can be reverse-engineered.


2. Alignment is shallow if it’s observable

If you can:

  • Identify refusal features
  • Rank them
  • Scale them

Then alignment is not a constraint—it is a parameter.

That’s uncomfortable for regulators.

And extremely interesting for builders.


3. Interpretability becomes a dual-use technology

CRaFT is framed as interpretability research.

In practice, it is:

  • A debugging tool
  • A control mechanism
  • A jailbreak accelerator

This dual-use nature is not accidental—it is structural.


4. Business implication: controllability > safety labels

For companies deploying AI systems:

The real question is no longer:

  • “Is the model safe?”

But:

  • “Can we control which circuits dominate behavior?”

This shifts AI from:

  • Static product → Dynamic system

Conclusion — The end of naive safety

CRaFT doesn’t just improve jailbreak success rates.

It exposes a deeper truth:

LLM behavior is not governed by rules—but by circuits that can be traced, ranked, and modified.

Once you see refusal as a circuit rather than a principle, the entire safety narrative changes.

And once you can intervene at that level, “alignment” becomes less of a guarantee—and more of a configuration.

Subtle. Powerful. Slightly unsettling.

Exactly where this field is heading.


Cognaptus: Automate the Present, Incubate the Future.