The “important head” was never the whole story
Audit.
That is where many discussions about mechanistic interpretability become less romantic. It is pleasant to say that an AI model has “reasoning circuits.” It is less pleasant to ask which exact parts of the model must be preserved before a behavior survives, which parts are merely along for the ride, and which parts were called important only because our tools were too blunt to see inside them.
The paper Multi-Granular Node Pruning for Circuit Discovery takes that bluntness problem seriously.1 Its starting point is simple: most automated circuit discovery methods still work at relatively coarse units, especially attention heads and MLP blocks. That is understandable. Transformers are large; search spaces explode; GPUs are not charitable institutions. But the convenience has a cost. When a method keeps an attention head, it often keeps the whole head. When it keeps an MLP block, it treats the block as though its internal neurons rise and fall together.
The paper’s main correction is not “pruning is useful.” We already knew that. The sharper claim is that the unit of interpretability matters. If the method can only see heads and blocks, then every discovered circuit inherits the eyesight of the method. A circuit may look large not because the model needs all those components, but because the method cannot ask a smaller question.
This paper asks the smaller question: can we discover task-specific circuits by pruning nodes across several granularities at once, from transformer blocks down to individual neurons, while preserving task behavior?
The answer, at least for GPT-2 small on three controlled circuit-discovery tasks, is yes. And the business lesson is more specific than “models are sparse.” The useful lesson is that internal model diagnosis may become cheaper, more granular, and more operationally repeatable when interpretability tools stop treating architectural components as indivisible atoms.
Not quite microscope-level truth. But at least we have stopped using a broom as a scalpel.
The mechanism: clean behavior fights corrupted behavior, and masks decide what survives
The method is easiest to understand as a controlled contest between two versions of the same task.
One input is clean. It should produce the original task-relevant behavior.
The other input is corrupted. It is perturbed in a task-specific way while preserving the broad task format. For the Greater-Than task, the paper gives the kind of example where a prompt about a year range is altered so the correct continuation changes. The point is not random noise. The point is targeted contrast: same general task, different answer.
The framework then runs a two-stream forward pass. At each candidate node, the model can mix clean and corrupted activations according to a learned mask:
When $m$ is close to 1, the clean activation is preserved. When $m$ is close to 0, the corrupted activation replaces it. During training, the method learns which nodes must remain clean for the model to preserve task performance.
This is the key interpretability move. The method is not merely zeroing out components and watching accuracy fall. It is asking: if this internal component is forced to behave like it came from the corrupted run, does the task behavior break?
If yes, keep it.
If no, prune it.
The masks are parameterized with a Hard-Concrete distribution, which lets the model optimize approximately binary gates with gradient descent. The training objective combines task preservation with sparsity penalties, so the model is pushed to keep only what it needs. After training, the masks are binarized, and the method enforces hierarchical consistency: if a parent unit such as an MLP block is pruned, its child neuron masks are also deactivated.
That hierarchy matters. A transformer has natural levels of structure: blocks, attention heads, MLP blocks, attention neurons, hidden neurons, output neurons. Existing edge-pruning methods often operate around connections among larger components. This paper instead applies learnable masks over multiple node types in one optimization process.
That is why the accepted framing for this article is mechanism-first. The result is not just a smaller circuit table. The mechanism changes the question from:
Which coarse components should we keep?
to:
At which level of the model hierarchy does this behavior actually live?
That is a better question. Annoyingly, better questions usually make older dashboards look less impressive.
Why coarse components can exaggerate circuit size
The reader misconception here is natural: if a method identifies an attention head or MLP block as important, we are tempted to treat the whole component as functionally important.
The paper pushes against that assumption.
A transformer component is an architectural boundary, not necessarily a functional boundary. An attention head can mix several behaviors. An MLP block can contain many neurons irrelevant to the specific task being studied. Treating those components as atomic is convenient, but it can overstate how much of the model is actually needed for a behavior.
The paper’s formulation directly addresses this mismatch. It does not force the circuit to choose one fixed granularity. Instead, the method lets sparsity appear where the task permits it. Sometimes a whole block can be removed. Sometimes a block remains active but most of its internal neurons disappear. Sometimes attention is heavily pruned while MLPs stay alive.
That flexibility is the main conceptual contribution.
A simple way to read the paper is this:
| Old habit | Paper’s correction | Why it matters |
|---|---|---|
| Treat heads and MLP blocks as indivisible circuit units | Learn masks over blocks, heads, MLPs, and neurons together | Important components may contain many irrelevant internal units |
| Search over edge structures among coarse components | Prune nodes at multiple granularities in one training loop | Avoids some scalability pressure from edge-level search |
| Report circuit size mainly at the component level | Report sparsity across several model levels | Makes circuit anatomy more task-specific and interpretable |
| Assume compression-like pruning and interpretability pruning are similar | Preserve behavior while identifying necessary internal units | The goal is explanation, not deployment speed |
The last row is worth pausing on. The paper borrows from pruning ideas, but its goal is not ordinary model compression. A compressed model is useful if it runs faster or cheaper. A circuit-discovery method is useful if it reveals which internal parts support a behavior.
Those are related, but not identical. Confusing them is how interpretability turns into procurement theater.
What the experiments actually test
The paper evaluates on GPT-2 small using three tasks that are common in circuit-discovery work.
The first is Indirect Object Identification, or IOI. The model must identify the correct indirect object in a sentence involving two names, suppressing the repeated name and predicting the other one. This task is a classic test of syntactic and entity-tracking circuitry.
The second is Gendered Pronouns. The model must assign higher probability to the gender-consistent pronoun for a name. This is not a moral endorsement of the association; it is a probe of whether the model encodes name-pronoun associations learned from data.
The third is Greater-Than. The model sees prompts involving year ranges and must assign more probability to valid completions greater than the starting year. This probes numerical and temporal comparison behavior.
The paper uses task-specific performance metrics for each: logit difference for IOI and Gendered Pronouns, probability difference for Greater-Than. It also uses KL divergence between the full model’s output distribution and the pruned circuit’s output distribution. That second metric matters because a circuit can preserve the answer while distorting the rest of the output distribution. KL divergence asks whether the pruned circuit remains faithful to the broader predictive behavior of the original model.
The likely purpose of the experimental components is as follows:
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Task metrics on IOI, GP, and GT | Main evidence | The pruned circuits preserve task-specific behavior | That the method works for all model families or all tasks |
| KL divergence against full-model outputs | Main evidence / faithfulness check | The circuits approximate the original model’s output distribution | That the selected nodes reveal all causal interactions |
| Sparsity by blocks, heads, and neurons | Main evidence | The method discovers smaller multi-level circuits | That sparsity itself equals explanation |
| Comparison with EAP and EP | Comparison with prior work | Node pruning can retain fewer attention heads while staying competitive | Direct superiority on every metric, since node and edge methods are not perfectly comparable |
| Appendix circuit summaries by layer | Implementation detail plus interpretive support | Sparsity has structured layer patterns, not random deletion | A universal map of where every model stores these behaviors |
| Memory comparison | Operational evidence | The method is much lighter than baselines in this setup | That production-scale interpretability is solved |
This distinction matters because the paper includes several kinds of evidence. The main results establish that multi-granular node pruning can preserve task behavior while removing many components. The appendix summaries help interpret where circuits concentrate by task. The baseline comparison helps position the method against EAP and EP, but the paper itself notes that node pruning and edge pruning are not perfectly equivalent because they prune different objects.
That honesty is useful. The comparison is informative, not a courtroom verdict.
The numbers: smaller circuits without obvious task collapse
The headline result is that substantial pruning happens across multiple granularities while task performance is preserved.
For the three tasks, the paper reports the following active component counts and sparsity:
| Granularity | IOI active / sparsity | GP active / sparsity | GT active / sparsity |
|---|---|---|---|
| Attention blocks | 4 / 66.7% | 5 / 58.3% | 5 / 58.3% |
| MLP blocks | 12 / 0.0% | 3 / 75.0% | 5 / 58.3% |
| Attention heads | 21 / 85.4% | 37 / 74.3% | 28 / 80.6% |
| Attention neurons | 907 / 90.2% | 1,702 / 81.5% | 1,701 / 81.5% |
| MLP hidden neurons | 12,300 / 33.4% | 1,333 / 96.4% | 4,570 / 87.6% |
| MLP output neurons | 1,329 / 14.4% | 1,411 / 84.7% | 3,520 / 61.8% |
| Edge compression | 96.74% | 93.74% | 95.95% |
The task-performance table is also important:
| Task | Base P/L diff | KL divergence | Pruned circuit P/L diff |
|---|---|---|---|
| IOI | 3.1791 | 0.6080 | 3.2030 |
| GP | 2.6198 | 0.4909 | 2.6150 |
| GT | 0.3711 | 0.0059 | 0.3912 |
The numbers should be read carefully.
For IOI, the pruned circuit’s logit difference is slightly higher than the base value, while KL divergence is 0.6080. For Gendered Pronouns, the logit difference is nearly unchanged. For Greater-Than, the probability/logit-style task metric improves from 0.3711 to 0.3912, with very low KL divergence at 0.0059.
This does not mean pruning magically improves the model in general. A safer interpretation is that removing redundant or interfering components can sharpen the task-specific signal in some settings. The paper itself suggests this possibility for Greater-Than. But the business reader should resist the urge to convert one controlled result into a general law. That way lies LinkedIn thought leadership, and nobody needs more of that.
The more reliable conclusion is narrower: on these tasks, the method finds much smaller circuits that preserve the targeted behavior.
The task anatomy is the interesting part
The most useful evidence is not just the aggregate sparsity. It is how differently the three tasks carve GPT-2.
IOI keeps all 12 MLP blocks active. That is already a warning against lazy generalization. The method is aggressive, but it does not blindly delete everything. IOI appears MLP-heavy in this experiment, while attention is mostly pruned except in selected late layers. The paper notes that only 21 of 144 attention heads remain active, with most attention activity concentrated near the final layers. This aligns with prior observations that name-mover heads tend to appear late.
Gendered Pronouns looks different. Only three MLP blocks remain active, and several middle layers are fully pruned. Attention activity appears in bursts, including later layers. The task has a more restricted output space, choosing between pronouns such as “he” and “she,” which may partly explain the greater compressibility.
Greater-Than is different again. It retains 28 attention heads and five MLP blocks, with attention concentrated around layers 7 to 10 and several early or late layers pruned. In the appendix summary, Greater-Than shows activity peaking around layers 9 and 10, while some layers are entirely disabled.
This is the interpretability lesson: there is no generic “small circuit shape.” Different behaviors occupy different model structures. If an interpretability tool forces every behavior into the same granularity, it may be simplifying the method rather than the model.
For business use, that matters because model audits are usually behavior-specific. A bank does not ask whether a model is interpretable in the abstract. It asks whether the model’s credit explanation, refusal behavior, fraud flag, or compliance-sensitive classification can be examined with enough specificity to support governance decisions.
The paper does not solve that full enterprise problem. But it points in the right direction: behavior-specific audits need behavior-specific internal evidence.
Baseline comparison: fewer heads, but not the same object
The paper compares its method with Edge Attribution Patching and Edge Pruning. This part needs careful reading because edge-pruning and node-pruning methods do not prune the same units. Edge methods focus on connections among components; this paper’s method prunes nodes across several granularities, including neurons and blocks. So the comparison is useful, but not perfectly apples-to-apples.
Still, the results are informative.
| Dataset | Method | Sparsity | KL divergence | Logit / probability diff | Attention heads retained |
|---|---|---|---|---|---|
| IOI | EAP | 96.74% | 2.447 | -0.181 | 116 |
| IOI | EP | 96.16% | 0.360 | 3.210 | 41 |
| IOI | Ours | 96.74% | 0.600 | 3.203 | 21 |
| GP | EAP | 93.74% | 0.148 | 2.564 | 106 |
| GP | Ours | 93.74% | 0.490 | 2.615 | 37 |
| GT | EAP | 95.95% | 0.086 | 0.374 | 113 |
| GT | EP | 96.69% | 0.039 | 0.389 | 74 |
| GT | Ours | 95.95% | 0.006 | 0.391 | 28 |
The cleanest win is Greater-Than: the node-pruning method reports the lowest KL divergence, the highest task metric, and far fewer retained attention heads.
IOI is more nuanced. Edge Pruning has lower KL divergence and slightly higher logit difference, but the proposed method keeps only 21 attention heads compared with EP’s 41 and EAP’s 116. That is a compactness-faithfulness trade-off, not a total domination story.
Gendered Pronouns is also nuanced. EAP has lower KL divergence, while the proposed method has a slightly higher task metric and far fewer retained heads.
The honest summary is this: multi-granular node pruning is competitive on task behavior and much more compact in retained attention heads, but edge-pruning methods can still win on some faithfulness metrics. The paper is not saying edges are obsolete. It is saying node-level multi-granularity exposes a different kind of sparsity that edge-centric methods miss.
That is a more interesting claim anyway.
The operational result: memory drops from expensive to plausible
The compute comparison is where the business reader should pay attention, but not overreact.
The paper reports that its method trains only 55,465 additional parameters for GPT-2 small, a model with about 124.5 million parameters. Because the method uses a base model and a prunable model for KL-loss training, it loads two model instances. For batch size 32, it reports 6,270 MB of memory use.
The baselines are much heavier in the paper’s setup: EAP requires 72,794 MB and EP requires 33,354 MB, because they store internal representations in memory.
This is the path from research result to business relevance. Not “now every company can fully understand its LLM.” That would be adorable, and false.
The practical inference is narrower:
| Paper result | Business interpretation | Boundary |
|---|---|---|
| Multi-granular masks can discover compact circuits in one optimization run | Internal behavior audits may become less bespoke and less infrastructure-heavy | Demonstrated on GPT-2 small, not frontier-scale proprietary models |
| Neuron-level pruning reveals irrelevant units inside retained components | Governance teams should be skeptical of coarse “important head” explanations | Node selection does not reveal full interaction paths |
| Memory use is much lower than EAP/EP in the reported setup | More teams may be able to run diagnostic interpretability experiments | Hardware, implementation, and model scale still matter |
| Task circuits differ structurally | Audit workflows should be behavior-specific, not model-general | Results from IOI, GP, and GT may not transfer to business tasks directly |
In enterprise terms, the value is cheaper diagnosis. A compliance, safety, or model-risk team may want to know whether a suspicious behavior depends on a small cluster of components, whether two behaviors share internal machinery, or whether a mitigation accidentally removes useful task behavior. A method like this could help generate evidence for those questions.
But it is evidence for diagnosis, not a production control system by itself.
What this does not yet give us
The paper’s limitation section is short but important: node-level masks identify which nodes are necessary, but they do not explicitly recover the interaction structure among those nodes.
That is a serious boundary.
A circuit is not merely a shopping list of parts. It is also a pattern of information flow. If the method says that certain heads, MLP blocks, and neurons remain active, we still do not automatically know which active nodes send information to which others, in what direction, or through which dependencies. Edge-pruning methods are more directly concerned with those interaction paths, even if they are more expensive and coarser.
So the best future direction may not be node pruning versus edge pruning. It may be hierarchical integration: use node pruning to identify a compact set of candidate components, then use edge-level analysis to map interactions among them.
The paper itself gestures toward this hybrid possibility. That is probably the right instinct. First find the smaller room. Then map the wiring inside it.
There is also the scale boundary. GPT-2 small remains a useful interpretability testbed, but it is not the same as auditing a deployed frontier model, a multimodal model, or an enterprise fine-tuned system with retrieval, tools, and policy layers wrapped around it. The method’s memory profile is promising, but scaling interpretability from controlled tasks to operational systems remains a separate challenge.
Finally, the paper flags a risk that should not be treated as decorative ethics. If a method can identify and remove task-relevant nodes, it might also be used to disable safety, refusal, or moderation-related pathways. Surgical interpretability can support governance. It can also support circumvention. The knife does not become moral because the paper has an appendix.
How Cognaptus would translate this into a model-risk workflow
The paper directly shows that multi-granular node pruning can find sparse, task-preserving circuits in GPT-2 small across three controlled tasks. Cognaptus would not infer from that result that enterprise LLM auditing is solved.
The more reasonable workflow implication is this:
-
Define the behavior narrowly. Do not ask whether “the model is biased” or “the model reasons.” Ask whether a specific input-output behavior survives targeted corruption.
-
Construct clean and corrupted examples. The corruption must preserve the task format while changing the relevant answer. This is where sloppy audit design will quietly ruin everything.
-
Run multi-granular circuit discovery. Learn masks across blocks, heads, MLPs, and neurons to identify the smallest behavior-preserving internal subset.
-
Compare task performance and distributional faithfulness. A circuit that preserves the headline answer but distorts the output distribution may not be a faithful behavioral explanation.
-
Use node results as diagnostic evidence, not final truth. If the behavior matters enough, follow up with interaction mapping, ablations, and human review.
That workflow is less glamorous than “transparent AI.” It is also more useful.
Transparency is not a property you sprinkle over a model after procurement. It is an investigation process. This paper improves one part of that process: finding a smaller and more precise set of internal components worth investigating.
The real contribution is changing the resolution of the question
The phrase “circuit discovery” can make the field sound as if the circuit is sitting inside the model, waiting politely to be extracted. The reality is messier. Every discovery method defines what it is capable of seeing. If the method sees only edges among coarse components, the discovered circuit will be shaped by that resolution.
This paper’s contribution is to change the resolution.
By learning masks across multiple granularities at once, it shows that task circuits can be sparse at several levels simultaneously. Whole blocks may disappear. Attention heads may vanish. Neurons inside retained components may still be unnecessary. And the surviving structure changes across IOI, Gendered Pronouns, and Greater-Than.
That is why the paper is best read mechanistically rather than as a leaderboard entry. The important move is not that one table has smaller numbers. The important move is that the method stops pretending that architectural units are functional atoms.
For businesses, the near-term payoff is not faster inference or a magic explanation button. It is better diagnostic economics: smaller candidate circuits, lower memory requirements, and more precise evidence about which internal components support a behavior. For governance teams, that can mean less guesswork. For model developers, it can mean sharper debugging. For everyone selling “AI transparency” as a dashboard, it may mean a slightly less comfortable Monday.
Good.
Interpretability should make vague claims uncomfortable.
Cognaptus: Automate the Present, Incubate the Future.
-
Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, and A.B. Siddique, “Multi-Granular Node Pruning for Circuit Discovery,” arXiv:2512.10903, 2025. https://arxiv.org/abs/2512.10903 ↩︎