CRaFT and the Illusion of Safety: When ‘Sorry’ Is Just a Circuit

A refusal is easy to recognize. The model says it cannot help. The sentence sounds polite. The compliance team relaxes for three seconds. Everyone moves on.

That is the comfortable version of AI safety: refusal as an observable behavior. The uncomfortable version is that refusal may be only the visible end of a much narrower internal computation. If that computation can be found, isolated, and steered, then the model’s “sorry, I can’t assist with that” is not a moral boundary. It is a circuit behavior. Very reassuring, in the same way a locked glass door is reassuring before someone points out the hinge.

The paper CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders studies exactly this problem.¹ Its central move is not merely to jailbreak a model more effectively. The more important contribution is diagnostic: it argues that the features most strongly activated by harmful prompts are not necessarily the features that actually control the refusal-versus-compliance decision.

That distinction matters. Many safety tests observe what a model says. Some interpretability methods observe which internal features light up. CRaFT asks a more expensive but sharper question: which features actually influence the output logits along the computational paths that lead to refusal or compliance?

The mistake: confusing loud features with decision-relevant features

The intuitive approach to refusal-feature discovery is simple. Compare harmful prompts with benign prompts. Find features that activate more strongly on harmful prompts. Treat those features as refusal-related. Then intervene on them.

This is not a foolish idea. It is also not enough.

A feature can activate strongly because the prompt contains a dangerous topic, a particular phrasing pattern, or an adversarial-looking style. That feature may be a good detector of what the prompt is about. It may not be a good lever for what the model decides to do next.

CRaFT’s criticism is therefore not “activation is useless.” The criticism is more precise: activation is a salience signal, not a causal routing signal. A highly active feature may be sitting in the internal theater waving its arms while another quieter feature actually touches the decision boundary.

The paper’s mechanism-first argument rests on two replacements:

Old habit	CRaFT’s replacement	Why it matters
Select features because they activate on harmful prompts	Select features because their influence propagates to refusal/compliance logits	Reduces confusion between topic detection and decision control
Compare harmful and benign prompt groups	Analyze boundary-critical harmful prompts where refusal and compliance are both plausible	Focuses the analysis near the actual behavioral switch
Treat sparse features as isolated units	Trace feature-to-feature and feature-to-logit paths through attribution graphs	Captures downstream influence instead of local brightness

This is the article’s core point. CRaFT is interesting less because it produces a stronger attack number, and more because it demonstrates why an apparently reasonable safety-audit shortcut can fail. The shortcut asks, “What lights up?” The better question is, “What reaches the switch?”

Cross-layer transcoders turn refusal into a traceable circuit

CRaFT relies on cross-layer transcoders, or CLTs. A CLT reconstructs a model’s MLP computations using sparse features, but unlike a per-layer sparse autoencoder, it is designed to represent how features from one layer contribute to computations in later layers. That cross-layer structure is important because refusal is not expected to be a single isolated neuron-like event. It is a computation that propagates.

The paper then builds attribution graphs over these CLT features. In these graphs, nodes represent sparse feature activations or selected output logits. Edges represent direct effects between nodes. The authors use this structure to measure whether a feature’s influence travels through downstream computation to the refusal or compliance side of the next-token decision.

This is a useful conceptual upgrade. A normal activation analysis is like walking through a factory and noting which machines are noisy. An attribution graph asks which machines are connected to the final output line. Noise is still information, but it is not the same as control.

For refusal, the relevant output is not a full essay-length answer. CRaFT focuses on the first response-token distribution. The authors define refusal and compliance token sets and look for prompts where the model assigns substantial probability to both sides. These are the boundary-critical prompts.

The logic is practical. If a prompt is obviously refused, the model’s refusal path may dominate too strongly. If a prompt is obviously complied with, the compliance path may dominate. Boundary-critical prompts are more revealing because both tendencies are present in the same harmful request. The paper is not comparing “bad prompt” versus “good prompt” and hoping the difference means refusal. It is looking near the point where the model is internally deciding which way to go.

Boundary-critical influence is the actual mechanism, not a decorative metric

CRaFT combines two design choices:

Boundary-critical sampling: select harmful prompts where refusal and compliance first-token probabilities are both high.
Influence-based scoring: rank CLT features by how their effects propagate through the attribution graph to refusal/compliance logits.

Either choice alone is weaker. The paper’s ablation is useful because it separates the moving parts rather than just celebrating the final method.

On Gemma-3-1B-it, the no-attack baseline reaches 5.0 LG4 attack success rate on JailBreakBench with a 0.42 judge score. Cross-group activation selection gets 3.0 LG4 and 0.41 judge score. Boundary-critical activation gets 5.0 and 0.42. In plain English: activation-based feature selection, whether cross-group or boundary-critical, basically does not move the needle.

Influence changes the picture. Cross-group influence reaches 34.0 LG4 but only 0.67 judge score. Boundary-critical influence reaches 62.0 LG4 and 2.95 judge score. That final combination is CRaFT’s main mechanism.

Feature-selection test	Likely purpose	What it supports	What it does not prove
Activation cross-group vs. no attack	Ablation of activation-based selection	High activation alone is weak for refusal steering	It does not prove activation is never useful for other interpretability tasks
Boundary activation	Ablation of sampling without influence	Boundary sampling does not help much if the scoring signal is still local activation	It does not isolate the value of graph paths
Cross-group influence	Ablation of influence without boundary-critical sampling	Influence matters, but prompt-group comparison still leaves confounds	It does not fully recover specific harmful compliance
Boundary-critical influence	Main CRaFT configuration	Both circuit influence and boundary sampling are needed for the strongest result	It remains tested under CLT-backed model settings, not arbitrary black-box systems

This is the strongest part of the paper. It does not merely say “our method works.” It shows that the obvious alternative fails for the specific reason the authors predicted: the model’s loud harmful-topic features are not necessarily the levers that determine refusal.

The headline number is large, but the judge metric tells the better story

In the main benchmark comparison, CRaFT reports a large improvement over the no-attack condition and over several prompt-based and steering baselines. Across JailBreakBench, HarmBench, AdvBench, and SorryBench, the average LG4 attack success rate rises from 6.7% in the no-attack setting to 57.4% with CRaFT. The corresponding judge score rises from 0.53 to 2.90 on a 0–5 scale.

The LG4 number is attention-grabbing. The judge score is more informative.

The paper is careful about a problem that matters in model-steering evaluation: a classifier may label a response unsafe even when the output is repetitive, malformed, or only superficially harmful. This is especially important when an intervention destabilizes generation. A model that babbles unsafe-looking fragments is not necessarily “successfully jailbroken” in any meaningful behavioral sense. It may just be broken. Congratulations, you turned the safety problem into a garbage-collection problem.

CRaFT therefore uses a rubric-based LLM-as-a-judge evaluation inspired by StrongREJECT, scoring whether a response refuses, how specific it is, and how convincing it is. The case analysis makes the difference concrete: some baseline steering outputs are labeled unsafe by the classifier but receive low specificity and convincingness scores because they collapse or fail to address the prompt. CRaFT’s outputs, by contrast, are more often specific and aligned with the harmful request.

For a business reader, the lesson is not “use CRaFT to attack models.” The lesson is that single safety classifiers can overstate either robustness or vulnerability depending on what they count. In steering contexts, a binary unsafe label can confuse three different phenomena:

Output pattern	Classifier risk	Governance interpretation
Real refusal bypass	Correctly counted as unsafe	Serious vulnerability
Surface compliance followed by collapse	May be counted as unsafe	Not the same as useful harmful compliance
Degenerate harmful-looking text	May be counted as unsafe	Measures model damage, not necessarily attack success

That distinction is valuable for red-team reporting. A board-level risk memo should not treat all “unsafe” generations as equal. Some show a real policy bypass. Some show brittle generation. Some show that the evaluator is too impressed by bad prose.

Capability preservation is the quiet but important result

A refusal intervention is not impressive if it destroys the rest of the model. Any sufficiently aggressive internal edit can make a model stop behaving normally. That does not identify a clean safety circuit; it identifies a hammer.

The paper therefore evaluates general capability on MMLU, GSM8K, and IFEval. CRaFT stays close to its same-backend no-steering baseline. In the raw capability table, the nnsight no-steering baseline scores 40.37 on MMLU, 25.50 on GSM8K, 31.50 on IFEval-P, and 46.23 on IFEval-I. CRaFT scores 40.50, 25.50, 31.00, and 46.54 respectively.

The MMLU difference is also tested directly. Out of 1,531 MMLU questions, the baseline answers 618 correctly and CRaFT answers 620 correctly; among 58 disagreements, 30 favor CRaFT and 28 favor the baseline. The authors interpret this as statistically indistinguishable from noise.

This matters because it separates CRaFT from dense refusal-direction removal. In the paper’s raw capability table, the Refusal-Direction baseline collapses to 0.00 on MMLU and 0.00 on GSM8K under its evaluated setting. The authors explain that all 1,531 MMLU responses fail to produce a parseable A/B/C/D answer. That is not a surgical intervention. That is the model equivalent of removing a smoke alarm by bulldozing the kitchen.

For safety engineering, the implication is double-edged. On one side, CRaFT is a sharper vulnerability probe because it can weaken refusal without obvious broad capability collapse. On the other side, that is precisely why circuit-level vulnerabilities are more serious than crude prompt hacks. If an attacker or internal tester can manipulate a narrow refusal-relevant path while leaving general capability intact, then user-facing quality checks may not reveal the safety regression.

The appendix tests robustness, not a second thesis

The appendix is worth reading because several tests clarify the boundary of the method. They should not be treated as separate grand claims.

First, the authors test whether their boundary-critical token convention depends too heavily on a minimal refusal/compliance token pair. They compare the strict setting with an extended set of refusal and compliance openers. The extended set increases coverage, but the top-100 boundary-critical pool overlaps heavily with the strict pool, and the selected prompts remain near the same refusal-compliance boundary. Purpose: robustness check. Interpretation: the method does not appear to depend entirely on one fragile two-token convention, at least for the tested Gemma setting.

Second, the steering-strength sweep shows an inverted-U pattern. Stronger steering first increases jailbreak success, then excessive steering degrades output quality and raises repetition degeneration. Purpose: sensitivity test. Interpretation: the method has a useful range, not an infinite “turn the knob harder” dynamic.

Third, the feature-bundle test shows that steering multiple high-ranked features is not additive. A single feature performs best; adding more selected features destabilizes generation and reduces scores. Purpose: ablation/sensitivity test. Interpretation: even circuit-guided features can conflict when intervened on jointly.

Fourth, the boundary-critical pool size test shows stable top-three feature ranking under several pool-size choices. Purpose: robustness test. Interpretation: the selected feature is less likely to be an artifact of one tiny prompt sample, although this remains within the same model and experimental setup.

Fifth, the runtime comparison shows CRaFT’s cost profile. On a single RTX A6000, CRaFT takes 16.6 minutes of one-time setup and about 17.67 seconds per attack generation in the reported setting, for 46 minutes over JailBreakBench-100. This is much less than per-prompt optimization methods such as GCG and AutoDAN in the paper’s setup, and far less than the reported 49.3-hour offline grid search for Steering-SAE. Purpose: operational comparison. Interpretation: circuit analysis is not free, but its cost is front-loaded; once the feature is selected, recurring generation has no extra overhead over the same backend.

These appendix results make CRaFT more credible, but they do not turn it into a universal safety scanner. They show that the mechanism is not immediately explained away by token-set choice, one lucky hyperparameter, or catastrophic capability damage.

Cross-model results show procedure transfer, not plug-and-play transfer

The paper also evaluates CRaFT on Gemma-3-270m-it and Llama-3.2-1B-Instruct, both with CLT support. The important detail is that the authors do not transfer the exact selected feature from Gemma-3-1B-it. They re-run boundary-critical sampling and influence-based feature selection for each model.

That makes the result more realistic and less magical. CRaFT is not presented as “find one refusal feature and use it everywhere.” It is a procedure for finding model-specific refusal-relevant features when the necessary interpretability infrastructure exists.

The reported cross-model LG4 averages increase from 9.6 to 23.7 on Gemma-3-270m-it and from 1.5 to 25.9 on Llama-3.2-1B-Instruct. The Llama result is especially notable because the no-attack baseline is below 5.0 on every benchmark in the table. Still, the paper also notes that the exact first-token proxy used for Gemma does not transfer cleanly to Llama; the authors use an extended token set for the cross-family model. They also choose model-specific steering multipliers from a small grid.

So the correct business reading is procedural transfer, not parameter-free transfer. The method generalizes as a workflow. The exact knobs do not.

That distinction matters for teams considering internal model audits. A model-risk team should not expect a universal refusal-feature registry that can be copied across architectures. The more plausible workflow is model-specific: obtain or train suitable interpretability components, identify boundary-critical prompts, trace circuits, rank influence, test steering effects, then evaluate capability and degeneration.

At that point, the word “audit” begins to look less like a checklist and more like an engineering project. Annoying, yes. Also probably the right shape of the problem.

What this means for AI governance teams

CRaFT is an attack paper on the surface, but its more useful business interpretation is defensive. It gives governance teams a sharper model of what can go wrong when safety is evaluated only through visible refusal behavior or local feature activation.

A practical governance translation looks like this:

Paper result	Direct meaning	Business inference	Boundary
Activation-selected features perform near no-attack baseline	Loud harmful-prompt features are weak steering targets	Feature salience dashboards can mislead if treated as causal evidence	Applies to tested CLT setting and refusal task
Boundary-critical influence performs best	Decision-relevant circuit paths matter	Red-teaming should inspect refusal boundary cases, not only obvious failures	Requires internal access and interpretability tooling
CRaFT preserves benchmark capability in tested settings	Refusal can be weakened without obvious broad capability loss	General performance monitoring may miss narrow safety regressions	Benchmarks are limited and do not cover all product behaviors
Judge metrics reveal classifier overcounting	Unsafe labels can include degenerate or irrelevant outputs	Risk reporting should distinguish actionable compliance from broken text	Judge quality itself remains a dependency
Runtime is front-loaded	Circuit feature selection is costly once, cheaper afterward	Internal audits may be feasible for high-risk models, not every minor release	Hardware, model size, and CLT availability matter

The strongest business implication is not that every company should reproduce CRaFT tomorrow. Most cannot. The method depends on CLT-backed models, white-box access, attribution graph extraction, and careful evaluation. This is not a SaaS dashboard button labeled “Find Safety Circuit,” although someone will surely make that button and charge by seat.

The useful lesson is architectural: safety assurance should move from output-only testing toward mechanism-aware testing where possible. Output testing asks whether the model refused the sampled prompt. Mechanism-aware testing asks whether the refusal behavior is robust because the decision path is stable, or fragile because a narrow internal lever can flip it.

That difference becomes important in regulated or high-liability deployments. A customer-service chatbot that refuses a harmful request in 500 test prompts may still be fragile if its refusal depends on a small set of steerable internal features. A healthcare, financial, or enterprise automation model may pass visible policy tests while retaining narrow decision circuits that are vulnerable under internal perturbation, fine-tuning drift, adapter interactions, or malicious model access.

CRaFT does not prove those production scenarios. It gives a reason to look for them.

What the paper directly shows, and what Cognaptus infers

It is useful to separate the evidence from the extrapolation.

The paper directly shows that, on CLT-backed small instruction models in its experiments, circuit-guided influence ranking identifies refusal features that are more effective steering targets than activation-based alternatives. It shows stronger jailbreak performance across several benchmarks, better specificity under judge evaluation, limited capability degradation on selected benchmarks, and some generalization when the pipeline is re-run on additional CLT-backed models.

Cognaptus infers that safety audits based only on visible refusal, classifier labels, or highly activated harmful-prompt features are incomplete. For business use, the relevant value of CRaFT-like methods is not “better attack automation.” It is cheaper and more precise diagnosis of where refusal behavior actually lives inside a model, assuming the organization has enough access to inspect internals.

What remains uncertain is large. The experiments focus on relatively small CLT-supported instruction models. The method’s dependence on interpretability infrastructure limits immediate applicability to closed frontier systems. Benchmark preservation does not guarantee preservation of all useful capabilities. Judge-based evaluation improves over binary classifiers, but it introduces its own evaluator dependency. And because CRaFT is a steering method, operational risk depends on who has model access, what deployment protections exist, and whether similar internal interventions are possible through fine-tuning, adapters, or compromised serving infrastructure.

Those limitations do not weaken the paper’s central warning. They define where the warning currently applies.

The safety lesson: refusal is behavior, not proof

The cleanest misconception to retire is that refusal behavior itself proves safety. It does not. Refusal is an output pattern generated by internal computation. If the computation is narrow, steerable, and separable from general capability, then refusal can be more fragile than it looks.

CRaFT’s contribution is to make that fragility visible at the circuit level. It shows that the important feature may not be the one that reacts most strongly to danger. It may be the one whose influence reaches the decision boundary.

For businesses deploying LLMs, that is the difference between testing the alarm sound and inspecting the wiring. The sound matters. But if the wiring is exposed, the sound is not the system.

Cognaptus: Automate the Present, Incubate the Future.

Su-Hyeon Kim, Hyundong Jin, Yejin Lee, and Yo-Sub Han, “CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders,” arXiv:2604.01604v2, 27 May 2026, https://arxiv.org/abs/2604.01604. ↩︎

The mistake: confusing loud features with decision-relevant features#

Cross-layer transcoders turn refusal into a traceable circuit#

Boundary-critical influence is the actual mechanism, not a decorative metric#

The headline number is large, but the judge metric tells the better story#

Capability preservation is the quiet but important result#

The appendix tests robustness, not a second thesis#

Cross-model results show procedure transfer, not plug-and-play transfer#

What this means for AI governance teams#

What the paper directly shows, and what Cognaptus infers#

The safety lesson: refusal is behavior, not proof#