Heads You Lose: Why Ablation-Reversible Interpretability Doesn’t Transfer

TL;DR for operators

The paper is a useful slap on the wrist for anyone tempted to turn an interpretability result into an operational control too quickly.¹ It asks a simple question: when an attention head looks important, contains readable information, and can restore model behaviour after ablation, does that mean it carries a transferable representation of the computation?

Mostly, no.

Across Qwen2.5-7B-Instruct, Llama-3-8B-Instruct, and Mistral-7B-Instruct-v0.2, the paper tests attention heads on arithmetic, comparison, digit-property, date, and time tasks, using factual recall as a control class. It finds heads and small head sets that are behaviourally important. It finds linear readouts that classify computation families with high accuracy. It finds cases where restoring the head’s clean activations repairs behaviour after ablation. Then it performs the test that actually matters: patch the head’s activation from one prompt into a matched prompt that asks for a different computation, and see whether the model’s behaviour shifts toward the source computation under controls.

That last test fails cleanly for every tested candidate. Some heads are doing answer-side logit bias. Some are prompt-side stabilisers. Some transfer broad context rather than computation-specific state. One SVD-discovered subspace gives the closest positive: a soft ordered-selection signal, but not enough to flip answers on adversarial low-margin cases.

For business use, the message is not “mechanistic interpretability is useless”. That would be the lazy version, and therefore popular. The better message is stricter: interpretability evidence is only useful for audits, routing, intervention, safety assurance, or debugging when the claim being made matches the test being run. “This head matters” is not “this head represents the task”. “This direction is decodable” is not “this direction is causally controlling behaviour”. “Patching repairs the same prompt” is not “this state transfers across contexts”.

The paper gives operators a more disciplined audit ladder:

Evidence level	What it supports	What it does not support
Selective necessity	This component matters for a measured behaviour	The component represents the behaviour
Linear decodability	Information is readable in the activation	The information is causally used there
Ablation reversibility	Restoring activations can repair damaged behaviour	The restored state is portable or semantic
Activation transduction	A stronger test of transferable causal role	Still bounded by task, model, and intervention design

The practical conclusion: do not buy a circuit story unless the intervention travels.

The comfortable mistake: a head matters, therefore it means something

A familiar interpretability workflow goes like this. First, find a component whose ablation hurts a capability. Second, show that the component’s activations linearly encode something interpretable. Third, patch the component’s original activations back in and recover the behaviour. Then, with enough plots and confidence, label the component.

“This is an addition head.”

“This is a comparison head.”

“This is the place where the model stores intent.”

Lovely. Also possibly wrong.

The paper’s central contribution is to break that workflow into separate claims. An attention head can be necessary without being semantically specific. It can contain linearly decodable information without being the causal locus of that information. It can restore behaviour in the same prompt without carrying a computation state that survives being moved into another prompt.

That distinction matters because modern AI governance is full of almost-causal language. Teams want to know whether a model “uses” a feature, “understands” a policy, “tracks” an instruction, or “represents” a risk condition. Those verbs sound precise until the evidence underneath them turns out to be “we found a direction that correlates nicely with the label”. That is not an audit. That is a mood board with eigenvectors.

The paper’s mechanism-first framing is useful because it does not ask whether interpretability is good or bad in the abstract. It asks which part of the inferential chain breaks.

KID separates recognition, selection, and execution

The paper introduces KID: Knowing, Intent, Doing. The names are plain; the discipline behind them is the point.

“Knowing” means the model recognises what the prompt asks for. “Doing” means the model executes or scores the answer. “Intent” is the hypothesised middle state: after recognising the requested computation, before generating the answer, the model has committed to a computation-selection state that could, in principle, be transferred.

The authors are careful not to claim that these are cleanly separated architectural modules. They are role labels for testing attention heads, not a diagram of the transformer’s soul. Sensible restraint. A rare mineral.

The KID frame converts vague role claims into evidence requirements:

KID role	Operational evidence required
Knowing	Prompt-side activations contain decodable requested-computation information and matter for prompt-level processing
Intent	Knowing-like evidence plus interventional generalisability: the state redirects another prompt toward the source computation
Doing	Answer-position activations are necessary and restoring answer-side activations repairs answer production

That middle row is the trap. Many heads look like they might be “intent” candidates until the test becomes cross-context. If an activation only restores its own prompt, it may be maintaining the local trajectory rather than carrying a portable computation instruction.

The business translation is simple: a control is only as strong as the distribution over which it remains causal. A fraud rule that works only on the example used to derive it is not a fraud rule. A model circuit that restores only its original prompt is not necessarily a computation switch.

The pipeline is a ladder, not a buffet

The paper’s empirical pipeline has three main stages: capability-selective screening, SVD/readout analysis, and activation transduction. Each stage answers a different question. Treating them as interchangeable is where the trouble starts.

Test	Likely purpose in the paper	What it supports	What it does not prove
Capability-selective screening (CSS)	Main evidence for finding behaviourally important head sets	Some heads selectively affect a measured capability	The heads represent or implement that capability
Random-mask and subset-lattice diagnostics	Robustness and sensitivity checks on CSS	Selected heads are not arbitrary and some effects are sparse/additive	Every selected top-5 set is a clean circuit
SVD/readout analysis	Representation audit and exploratory structure finding	Computation-family or subtype information is linearly present	The readable information is causally controlling behaviour
Counterbalanced SVD controls	Robustness against surface artefacts	Some directions are semantic rather than answer-string or format artefacts	Semantic readout equals portable mechanism
Full-trajectory restore	Role localisation	A head’s necessary effect is prompt-side, answer-side, or mixed	Prompt-side recovery equals transferable intent
Activation transduction	Main evidence for interventional generalisability	Whether a state can redirect a matched target prompt	Broader generalisation beyond these tasks and models
Same-computation, same-answer, same-prompt controls	Critical controls for semantic specificity and patch stability	Whether movement is computation-specific rather than context or answer familiarity	That all possible intervention sites have been ruled out

This table is the heart of the paper. The result is not merely that a few candidate heads failed a benchmark. The result is that evidence types that often get rhetorically stacked in interpretability work are not substitutes for one another.

CSS tells you where to look. SVD tells you what is readable. Restore tells you where the intervention repairs the trajectory. Transduction tells you whether the state travels.

Most operational failures in AI evaluation come from confusing “where to look” with “what to trust”. This paper is a tidy example.

Stage 1: the heads are real behavioural objects

The first stage is not a straw man. CSS does find compact behavioural targets.

The authors zero attention head outputs and measure damage to reference-answer log probability. A head’s selectivity is the target-family damage minus the maximum damage to non-target families. They screen across three models and six families: the five simple-computation families plus factual recall as a control. Table 4 resolves the screen into 13 cleanly selective model-family cells, 2 broader cells with collateral damage, and 3 not-selective cells.

That matters. The paper is not saying “ablation is noisy, therefore ignore everything”. It is saying ablation is useful for locating behavioural pressure points, but insufficient for assigning semantic roles.

The distinction is especially important for enterprise AI. A monitoring team may discover that suppressing a module, token pattern, retrieval feature, or hidden state changes the model’s output. That discovery is valuable. It may help locate fragile dependencies. But if the team immediately names the object—“this is the compliance concept”, “this is the medical reasoning feature”, “this is the instruction-following head”—it has crossed from evidence into theatre.

CSS earns the right to investigate. It does not earn the right to narrate.

Stage 2: linear decodability is impressive and still not causal

The representation audit looks strong at first glance. CSS top-5 concatenated residual readouts achieve family-level classification accuracies of 0.77–0.91 in Qwen, 0.79–0.93 in Llama, and 0.76–0.93 in Mistral. Individual heads can carry fine-grained subtype information: Llama digits rank-1 reaches subtype accuracy 0.920, and Mistral digits rank-1 reaches 0.840.

That is not trivial. The selected heads are not opaque junk drawers. They contain structure aligned with the task families.

But the paper makes the right move: it does not confuse readable structure with causal role. A linear probe can find information that the model does not rely on at that site. A direction can encode an answer string, a prompt format, or a broader contextual trajectory. Counterbalanced controls are therefore essential: the authors use same-answer and same-format controls to detect when dominant SVD directions are surface-sensitive rather than semantically primary.

For business readers, this is the “dashboard fallacy” in neural form. A metric can be predictive and still not be the lever. A feature can correlate with churn and still not be the reason customers leave. A hidden activation can decode “addition” and still not be where the model chooses to add.

Linear decodability is a map pin. It is not a steering wheel.

Full-trajectory restore reveals the prompt-answer split

Ablation reversibility sounds stronger than decodability. If zeroing a head damages behaviour and restoring its clean activations repairs the output, surely the head is doing the thing?

Only if “the thing” has been localised.

The paper’s full-trajectory restore separates token slices: prompt-all, prompt-second-half, answer-all, answer-first, and answer-rest. This matters because a head that repairs behaviour only when restored at answer positions is probably involved in scoring or formatting the answer. That is Doing, not Intent.

The results show why aggregates are dangerous. CSS top-5 sets often mix roles. One rank may be answer-side; another may be prompt-side; the aggregate looks capability-related because it bundles several different causal contributions. This is not a small bookkeeping issue. If a team patches a whole aggregate and calls it “the task circuit”, it may be averaging together a prompt stabiliser, an answer logit-bias head, and a miscellaneous context carrier. Voilà: mechanistic soup.

The qwen maths case is clean. Rank-1 head L23H12 has zero-all damage of 0.259, prompt-all recovery of only 0.022, and answer-all recovery of 1.000. That is an answer-side role. It supports the correct answer during generation; it is not a prompt-side intent state.

The llama digits case is mixed. It has the attractive ingredients of an intent story: selective necessity, a highly subtype-decodable rank-1 head, and meaningful prompt-side recovery in the aggregate. But rank-level restore separates answer-side heads from a prompt-side rank. When the authors test the prompt-side rank directly, it does not redirect behaviour. The pretty story falls apart, as pretty stories often do when asked to do work.

Activation transduction is where the claim pays rent

The decisive test is activation transduction.

The setup is straightforward. Take clean activations from a source prompt asking for computation A. Patch them into a matched target prompt asking for computation B. The source and target are controlled for template, candidate set, answer format, and token position. Then ask: does the patched target behave more like the source computation?

A clean intent-like state should do more than move log probabilities. It should move the model specifically toward the source computation, more than same-computation or same-answer controls, and ideally change answer ordering in hard cases. Otherwise the patch may simply be transferring broad context, answer familiarity, or some other nuisance signal.

The paper’s evidence matrix is blunt: every row tested for interventional generalisability fails cleanly. The eight prompt-side rows in Table 1 all fail property (4), whether by being inert, negative, or contaminated by control movement.

A few cases show the failure modes clearly:

Case	What looked promising	What transduction showed	Role interpretation
qwen maths L23H12	Selective, necessary, restores behaviour	Restore is answer-side, not prompt-side	Answer-side logit-bias head
llama digits rank 3	Prompt-side recovery around 1.003	Source delta about +0.001; no meaningful redirection	Prompt-side stabiliser
qwen times L0H0	Strong prompt-side recovery, source logprob movement	Same-computation controls move comparably	Prompt-primary but context-broad
llama compare ranks 2 and 3	Clean prompt-side candidates	Source deltas are negative or near zero; answer ordering unchanged	Prompt-side stabilisers
mistral dates L25H12	Prompt-side recovery around 1.016	Source patch essentially inert	Prompt-side stabiliser
mistral compare	Prompt-primary movement	Same-answer or same-relation controls explain much of it	Context-broad transfer

The important word is not “negative”. It is “diagnostic”. Each failed transduction test tells us what the earlier evidence was actually measuring.

Some heads help preserve a prompt trajectory. They are necessary because removing them destabilises the computation path, not because their activation is a portable computation instruction. Some heads bias answer logits. They are necessary because they support the correct answer at generation time, not because they represent the requested task. Some patches move probability mass, but controls move too, which means the patch is dragging broad context rather than semantic intent.

This is exactly the kind of distinction audit teams need. The failure is not that the model has no structure. The failure is that the structure is not the one the convenient label advertised.

The closest positive is useful because it is not quite positive

The paper’s closest positive is outside the CSS-localised head sets. An SVD-identified subspace, SV2:SV5, of the qwen compare_anchor head L14H15 carries ordered-selection information across numbers, digits, letters, dates, and times. When injected downstream at layer 18, it produces a small source-logprob lift of about +0.222 and scales monotonically with injection strength.

That is real signal. It also fails the stricter bar. On adversarial low-margin ordered rows, the patch produces no answer-ordering change at $\alpha = 0.80$, and same-answer controls move comparably.

This result strengthens the paper rather than weakening it. If every transduction test were flat, one could worry that the assay was simply too blunt. The compare-SV2 result shows the method can detect soft computation-pattern transfer when it exists. It just does not find a clean computation switch in the CSS-selected heads.

Operationally, this is the difference between “we found a useful diagnostic signal” and “we found a safe intervention handle”. The former can guide analysis. The latter can support control. Many AI programmes would save time by not confusing the two.

Same-answer controls are the quiet killer feature

The same-answer control is one of the most practically important pieces of the paper.

In activation transduction, a same-answer control shares the answer string with the source but differs in the requested computation. If the patch still moves the model similarly, then the intervention may be transferring answer familiarity rather than computation selection.

This is not a niche technical annoyance. It is the generic failure mode of semantic evaluation: the test accidentally rewards the surface correlate. In enterprise settings, this appears everywhere. A customer-support classifier “understands escalation” because it keys on angry phrasing. A retrieval system “finds policy relevance” because it keys on department names. A model “tracks risk” because the word “urgent” appears in the prompt. The label looks semantic. The mechanism is often a cheaper proxy wearing a nice jacket.

Same-answer controls are therefore not just a patching detail. They are a pattern for better AI audits: when testing whether a model has learned a distinction, construct controls that preserve the obvious answer or surface cue while changing the underlying operation.

Without that, the model may pass because it recognises the costume, not the character.

What the paper directly shows

The direct findings are narrower and stronger than a generic “interpretability is hard” conclusion.

First, capability-selective screening can find real behavioural objects. The selected heads are often statistically and behaviourally meaningful, and subset diagnostics show that some effects are concentrated enough to warrant mechanistic investigation.

Second, behaviourally important heads often contain linearly decodable information. The representation audit finds high family-level readout accuracy and some strong subtype readouts.

Third, ablation reversibility must be localised by token position. Prompt-side and answer-side restoration imply different roles. Without full-trajectory restore, aggregate ablation can conflate stabilisers with answer-generation components.

Fourth, prompt-side ablation reversibility does not imply interventional generalisability. This is the paper’s central evidence claim. Across the tested candidates, heads that pass the first three checks fail to transport the computation into matched target prompts under controls.

Fifth, the observed role taxonomy is heterogeneous: answer-side logit-bias heads, prompt-side stabilisers, mixed components, prompt-primary context-broad heads, and a soft computation-pattern carrier. No tested head satisfies the full KID criteria for an intent state.

What Cognaptus infers for business use

The paper is foundational research, not a plug-and-play enterprise auditing package. Still, the business interpretation is clear.

AI audit teams should treat interpretability role claims as causal-control claims. If a vendor, lab, or internal team says a component “represents” safety policy, tool intent, compliance risk, medical reasoning, or task selection, the next question should be: under what cross-context intervention does that claim survive?

A useful governance checklist would ask:

Audit question	Minimum evidence to request
Does this component matter?	Capability-selective ablation with collateral-damage checks
Is relevant information present?	Readout or representation audit with surface controls
Where does the causal repair occur?	Full-trajectory restore by token slice or processing stage
Does the state transfer?	Cross-context activation transduction with same-computation and same-answer controls
Can it support intervention?	Answer-ordering or decision changes on hard, low-margin cases

This is especially relevant for companies considering mechanistic interpretability as part of model assurance. A circuit explanation should not be treated as a compliance artefact unless the evidence matches the operational claim. A static description of an internal feature may be useful for research. It is not automatically a control surface.

There is also a cost implication. The paper reports roughly 60–80 A100-GPU-hours across all experiments, with no model training, only inference-time forward passes. That is not free, but it is within the budget of serious model evaluation work. The expensive part is not the compute. The expensive part is designing controls that prevent the team from fooling itself. Annoying, but cheaper than deploying superstition.

What remains uncertain

The boundaries are material.

The models are three publicly available instruction-tuned systems in the 7–8B range: Qwen2.5-7B-Instruct, Llama-3-8B-Instruct, and Mistral-7B-Instruct-v0.2. The findings may not generalise to larger models, different architectures, base models, or systems trained with different post-training methods.

The prompt families are synthetic investigative probes. They are useful because they have anti-correlated subtypes, controlled answer formats, and verifiable labels. They are not a complete ontology of model capabilities. Results on arithmetic, comparison, digit properties, dates, and times do not automatically transfer to open-ended writing, tool-use, legal reasoning, medical advice, long-horizon planning, or multi-step chain-of-thought.

The study focuses on attention heads. It does not rule out transferable computation-selection states in MLPs, residual stream directions, larger distributed circuits, or downstream non-attention sites. In fact, the soft compare-SV2 result hints that SVD-first or downstream searches may find different loci.

The role taxonomy is preliminary. It is a useful map of failure modes observed here, not a final periodic table of transformer internals.

The skeptical appendices matter too. CSS-selected top-5 sets are statistically unusual versus random additive head sets in audited cells, but that does not mean every capability is truly sparse. SVD/readout structure is real, with 441 of 468 audited SVD alignment rows exceeding shuffled-label p95, but meaningful representation is still not mechanism. The authors do not prove broad polysemantic packing pressure. They leave it as a live conjecture, which is what one does when the evidence is not enough. Some papers still remember how.

The operator’s replacement belief

The misconception to discard is:

If a head is necessary, linearly decodable, and restores behaviour after ablation, it must represent or implement the computation.

The replacement belief is stricter:

A mechanistic role claim is only as strong as the most specific intervention it survives under matched controls.

That replacement is less satisfying. It gives fewer dramatic labels. It makes interpretability dashboards messier. It forces teams to distinguish behavioural importance, readable information, same-context repair, and cross-context causal transfer.

Good. The old comfort was premature.

For businesses, the relevant question is not whether interpretability can produce elegant stories. It can. The relevant question is whether those stories can support decisions: auditing, debugging, routing, safety assurance, model selection, incident analysis, or intervention design. This paper says the evidence bar must be higher when the claim becomes operational.

A head that matters is worth investigating. A head that decodes is worth inspecting. A head that restores is worth localising. But a head that transfers under controls is the one that starts to look like a control surface.

Until then, keep the label in pencil.

Cognaptus: Automate the Present, Incubate the Future.

Philip Quirke, “Ablation-Reversible Heads Don’t Transfer: A Stress Test for Mechanistic Role Claims in Transformers,” arXiv:2606.08292v1, 6 June 2026, https://arxiv.org/abs/2606.08292. ↩︎

TL;DR for operators#

The comfortable mistake: a head matters, therefore it means something#

KID separates recognition, selection, and execution#

The pipeline is a ladder, not a buffet#

Stage 1: the heads are real behavioural objects#

Stage 2: linear decodability is impressive and still not causal#

Full-trajectory restore reveals the prompt-answer split#

Activation transduction is where the claim pays rent#

The closest positive is useful because it is not quite positive#

Same-answer controls are the quiet killer feature#

What the paper directly shows#

What Cognaptus infers for business use#

What remains uncertain#

The operator’s replacement belief#