When Circuits Go Atomic: Pruning Transformers One Neuron at a Time

The “important head” was never the whole story

Audit.

That is where many discussions about mechanistic interpretability become less romantic. It is pleasant to say that an AI model has “reasoning circuits.” It is less pleasant to ask which exact parts of the model must be preserved before a behavior survives, which parts are merely along for the ride, and which parts were called important only because our tools were too blunt to see inside them.

The paper Multi-Granular Node Pruning for Circuit Discovery takes that bluntness problem seriously.¹ Its starting point is simple: most automated circuit discovery methods still work at relatively coarse units, especially attention heads and MLP blocks. That is understandable. Transformers are large; search spaces explode; GPUs are not charitable institutions. But the convenience has a cost. When a method keeps an attention head, it often keeps the whole head. When it keeps an MLP block, it treats the block as though its internal neurons rise and fall together.

The paper’s main correction is not “pruning is useful.” We already knew that. The sharper claim is that the unit of interpretability matters. If the method can only see heads and blocks, then every discovered circuit inherits the eyesight of the method. A circuit may look large not because the model needs all those components, but because the method cannot ask a smaller question.

This paper asks the smaller question: can we discover task-specific circuits by pruning nodes across several granularities at once, from transformer blocks down to individual neurons, while preserving task behavior?

The answer, at least for GPT-2 small on three controlled circuit-discovery tasks, is yes. And the business lesson is more specific than “models are sparse.” The useful lesson is that internal model diagnosis may become cheaper, more granular, and more operationally repeatable when interpretability tools stop treating architectural components as indivisible atoms.

Not quite microscope-level truth. But at least we have stopped using a broom as a scalpel.

The mechanism: clean behavior fights corrupted behavior, and masks decide what survives

The method is easiest to understand as a controlled contest between two versions of the same task.

One input is clean. It should produce the original task-relevant behavior.

The other input is corrupted. It is perturbed in a task-specific way while preserving the broad task format. For the Greater-Than task, the paper gives the kind of example where a prompt about a year range is altered so the correct continuation changes. The point is not random noise. The point is targeted contrast: same general task, different answer.

The framework then runs a two-stream forward pass. At each candidate node, the model can mix clean and corrupted activations according to a learned mask:

$$ h = m \cdot h_{\text{clean}} + (1-m) \cdot h_{\text{corrupted}} $$

When $m$ is close to 1, the clean activation is preserved. When $m$ is close to 0, the corrupted activation replaces it. During training, the method learns which nodes must remain clean for the model to preserve task performance.

This is the key interpretability move. The method is not merely zeroing out components and watching accuracy fall. It is asking: if this internal component is forced to behave like it came from the corrupted run, does the task behavior break?

If yes, keep it.

If no, prune it.

The masks are parameterized with a Hard-Concrete distribution, which lets the model optimize approximately binary gates with gradient descent. The training objective combines task preservation with sparsity penalties, so the model is pushed to keep only what it needs. After training, the masks are binarized, and the method enforces hierarchical consistency: if a parent unit such as an MLP block is pruned, its child neuron masks are also deactivated.

That hierarchy matters. A transformer has natural levels of structure: blocks, attention heads, MLP blocks, attention neurons, hidden neurons, output neurons. Existing edge-pruning methods often operate around connections among larger components. This paper instead applies learnable masks over multiple node types in one optimization process.

That is why the accepted framing for this article is mechanism-first. The result is not just a smaller circuit table. The mechanism changes the question from:

Which coarse components should we keep?

to:

At which level of the model hierarchy does this behavior actually live?

That is a better question. Annoyingly, better questions usually make older dashboards look less impressive.

Why coarse components can exaggerate circuit size

The reader misconception here is natural: if a method identifies an attention head or MLP block as important, we are tempted to treat the whole component as functionally important.

The paper pushes against that assumption.

A transformer component is an architectural boundary, not necessarily a functional boundary. An attention head can mix several behaviors. An MLP block can contain many neurons irrelevant to the specific task being studied. Treating those components as atomic is convenient, but it can overstate how much of the model is actually needed for a behavior.

The paper’s formulation directly addresses this mismatch. It does not force the circuit to choose one fixed granularity. Instead, the method lets sparsity appear where the task permits it. Sometimes a whole block can be removed. Sometimes a block remains active but most of its internal neurons disappear. Sometimes attention is heavily pruned while MLPs stay alive.

That flexibility is the main conceptual contribution.

A simple way to read the paper is this:

Old habit	Paper’s correction	Why it matters
Treat heads and MLP blocks as indivisible circuit units	Learn masks over blocks, heads, MLPs, and neurons together	Important components may contain many irrelevant internal units
Search over edge structures among coarse components	Prune nodes at multiple granularities in one training loop	Avoids some scalability pressure from edge-level search
Report circuit size mainly at the component level	Report sparsity across several model levels	Makes circuit anatomy more task-specific and interpretable
Assume compression-like pruning and interpretability pruning are similar	Preserve behavior while identifying necessary internal units	The goal is explanation, not deployment speed

The last row is worth pausing on. The paper borrows from pruning ideas, but its goal is not ordinary model compression. A compressed model is useful if it runs faster or cheaper. A circuit-discovery method is useful if it reveals which internal parts support a behavior.

Those are related, but not identical. Confusing them is how interpretability turns into procurement theater.

What the experiments actually test

The paper evaluates on GPT-2 small using three tasks that are common in circuit-discovery work.

The first is Indirect Object Identification, or IOI. The model must identify the correct indirect object in a sentence involving two names, suppressing the repeated name and predicting the other one. This task is a classic test of syntactic and entity-tracking circuitry.

The second is Gendered Pronouns. The model must assign higher probability to the gender-consistent pronoun for a name. This is not a moral endorsement of the association; it is a probe of whether the model encodes name-pronoun associations learned from data.

The third is Greater-Than. The model sees prompts involving year ranges and must assign more probability to valid completions greater than the starting year. This probes numerical and temporal comparison behavior.

The paper uses task-specific performance metrics for each: logit difference for IOI and Gendered Pronouns, probability difference for Greater-Than. It also uses KL divergence between the full model’s output distribution and the pruned circuit’s output distribution. That second metric matters because a circuit can preserve the answer while distorting the rest of the output distribution. KL divergence asks whether the pruned circuit remains faithful to the broader predictive behavior of the original model.

The likely purpose of the experimental components is as follows:

Evidence item	Likely purpose	What it supports	What it does not prove
Task metrics on IOI, GP, and GT	Main evidence	The pruned circuits preserve task-specific behavior	That the method works for all model families or all tasks
KL divergence against full-model outputs	Main evidence / faithfulness check	The circuits approximate the original model’s output distribution	That the selected nodes reveal all causal interactions
Sparsity by blocks, heads, and neurons	Main evidence	The method discovers smaller multi-level circuits	That sparsity itself equals explanation
Comparison with EAP and EP	Comparison with prior work	Node pruning can retain fewer attention heads while staying competitive	Direct superiority on every metric, since node and edge methods are not perfectly comparable
Appendix circuit summaries by layer	Implementation detail plus interpretive support	Sparsity has structured layer patterns, not random deletion	A universal map of where every model stores these behaviors
Memory comparison	Operational evidence	The method is much lighter than baselines in this setup	That production-scale interpretability is solved

This distinction matters because the paper includes several kinds of evidence. The main results establish that multi-granular node pruning can preserve task behavior while removing many components. The appendix summaries help interpret where circuits concentrate by task. The baseline comparison helps position the method against EAP and EP, but the paper itself notes that node pruning and edge pruning are not perfectly equivalent because they prune different objects.

That honesty is useful. The comparison is informative, not a courtroom verdict.

The numbers: smaller circuits without obvious task collapse

The headline result is that substantial pruning happens across multiple granularities while task performance is preserved.

For the three tasks, the paper reports the following active component counts and sparsity:

Granularity	IOI active / sparsity	GP active / sparsity	GT active / sparsity
Attention blocks	4 / 66.7%	5 / 58.3%	5 / 58.3%
MLP blocks	12 / 0.0%	3 / 75.0%	5 / 58.3%
Attention heads	21 / 85.4%	37 / 74.3%	28 / 80.6%
Attention neurons	907 / 90.2%	1,702 / 81.5%	1,701 / 81.5%
MLP hidden neurons	12,300 / 33.4%	1,333 / 96.4%	4,570 / 87.6%
MLP output neurons	1,329 / 14.4%	1,411 / 84.7%	3,520 / 61.8%
Edge compression	96.74%	93.74%	95.95%

The task-performance table is also important:

Task	Base P/L diff	KL divergence	Pruned circuit P/L diff
IOI	3.1791	0.6080	3.2030
GP	2.6198	0.4909	2.6150
GT	0.3711	0.0059	0.3912

The numbers should be read carefully.

For IOI, the pruned circuit’s logit difference is slightly higher than the base value, while KL divergence is 0.6080. For Gendered Pronouns, the logit difference is nearly unchanged. For Greater-Than, the probability/logit-style task metric improves from 0.3711 to 0.3912, with very low KL divergence at 0.0059.

This does not mean pruning magically improves the model in general. A safer interpretation is that removing redundant or interfering components can sharpen the task-specific signal in some settings. The paper itself suggests this possibility for Greater-Than. But the business reader should resist the urge to convert one controlled result into a general law. That way lies LinkedIn thought leadership, and nobody needs more of that.

The more reliable conclusion is narrower: on these tasks, the method finds much smaller circuits that preserve the targeted behavior.

The task anatomy is the interesting part

The most useful evidence is not just the aggregate sparsity. It is how differently the three tasks carve GPT-2.

IOI keeps all 12 MLP blocks active. That is already a warning against lazy generalization. The method is aggressive, but it does not blindly delete everything. IOI appears MLP-heavy in this experiment, while attention is mostly pruned except in selected late layers. The paper notes that only 21 of 144 attention heads remain active, with most attention activity concentrated near the final layers. This aligns with prior observations that name-mover heads tend to appear late.

Gendered Pronouns looks different. Only three MLP blocks remain active, and several middle layers are fully pruned. Attention activity appears in bursts, including later layers. The task has a more restricted output space, choosing between pronouns such as “he” and “she,” which may partly explain the greater compressibility.

Greater-Than is different again. It retains 28 attention heads and five MLP blocks, with attention concentrated around layers 7 to 10 and several early or late layers pruned. In the appendix summary, Greater-Than shows activity peaking around layers 9 and 10, while some layers are entirely disabled.

This is the interpretability lesson: there is no generic “small circuit shape.” Different behaviors occupy different model structures. If an interpretability tool forces every behavior into the same granularity, it may be simplifying the method rather than the model.

For business use, that matters because model audits are usually behavior-specific. A bank does not ask whether a model is interpretable in the abstract. It asks whether the model’s credit explanation, refusal behavior, fraud flag, or compliance-sensitive classification can be examined with enough specificity to support governance decisions.

The paper does not solve that full enterprise problem. But it points in the right direction: behavior-specific audits need behavior-specific internal evidence.

Baseline comparison: fewer heads, but not the same object

The paper compares its method with Edge Attribution Patching and Edge Pruning. This part needs careful reading because edge-pruning and node-pruning methods do not prune the same units. Edge methods focus on connections among components; this paper’s method prunes nodes across several granularities, including neurons and blocks. So the comparison is useful, but not perfectly apples-to-apples.

Still, the results are informative.

Dataset	Method	Sparsity	KL divergence	Logit / probability diff	Attention heads retained
IOI	EAP	96.74%	2.447	-0.181	116
IOI	EP	96.16%	0.360	3.210	41
IOI	Ours	96.74%	0.600	3.203	21
GP	EAP	93.74%	0.148	2.564	106
GP	Ours	93.74%	0.490	2.615	37
GT	EAP	95.95%	0.086	0.374	113
GT	EP	96.69%	0.039	0.389	74
GT	Ours	95.95%	0.006	0.391	28

The cleanest win is Greater-Than: the node-pruning method reports the lowest KL divergence, the highest task metric, and far fewer retained attention heads.

IOI is more nuanced. Edge Pruning has lower KL divergence and slightly higher logit difference, but the proposed method keeps only 21 attention heads compared with EP’s 41 and EAP’s 116. That is a compactness-faithfulness trade-off, not a total domination story.

Gendered Pronouns is also nuanced. EAP has lower KL divergence, while the proposed method has a slightly higher task metric and far fewer retained heads.

The honest summary is this: multi-granular node pruning is competitive on task behavior and much more compact in retained attention heads, but edge-pruning methods can still win on some faithfulness metrics. The paper is not saying edges are obsolete. It is saying node-level multi-granularity exposes a different kind of sparsity that edge-centric methods miss.

That is a more interesting claim anyway.

The operational result: memory drops from expensive to plausible

The compute comparison is where the business reader should pay attention, but not overreact.

The paper reports that its method trains only 55,465 additional parameters for GPT-2 small, a model with about 124.5 million parameters. Because the method uses a base model and a prunable model for KL-loss training, it loads two model instances. For batch size 32, it reports 6,270 MB of memory use.

The baselines are much heavier in the paper’s setup: EAP requires 72,794 MB and EP requires 33,354 MB, because they store internal representations in memory.

This is the path from research result to business relevance. Not “now every company can fully understand its LLM.” That would be adorable, and false.

The practical inference is narrower:

Paper result	Business interpretation	Boundary
Multi-granular masks can discover compact circuits in one optimization run	Internal behavior audits may become less bespoke and less infrastructure-heavy	Demonstrated on GPT-2 small, not frontier-scale proprietary models
Neuron-level pruning reveals irrelevant units inside retained components	Governance teams should be skeptical of coarse “important head” explanations	Node selection does not reveal full interaction paths
Memory use is much lower than EAP/EP in the reported setup	More teams may be able to run diagnostic interpretability experiments	Hardware, implementation, and model scale still matter
Task circuits differ structurally	Audit workflows should be behavior-specific, not model-general	Results from IOI, GP, and GT may not transfer to business tasks directly

In enterprise terms, the value is cheaper diagnosis. A compliance, safety, or model-risk team may want to know whether a suspicious behavior depends on a small cluster of components, whether two behaviors share internal machinery, or whether a mitigation accidentally removes useful task behavior. A method like this could help generate evidence for those questions.

But it is evidence for diagnosis, not a production control system by itself.

What this does not yet give us

The paper’s limitation section is short but important: node-level masks identify which nodes are necessary, but they do not explicitly recover the interaction structure among those nodes.

That is a serious boundary.

A circuit is not merely a shopping list of parts. It is also a pattern of information flow. If the method says that certain heads, MLP blocks, and neurons remain active, we still do not automatically know which active nodes send information to which others, in what direction, or through which dependencies. Edge-pruning methods are more directly concerned with those interaction paths, even if they are more expensive and coarser.

So the best future direction may not be node pruning versus edge pruning. It may be hierarchical integration: use node pruning to identify a compact set of candidate components, then use edge-level analysis to map interactions among them.

The paper itself gestures toward this hybrid possibility. That is probably the right instinct. First find the smaller room. Then map the wiring inside it.

There is also the scale boundary. GPT-2 small remains a useful interpretability testbed, but it is not the same as auditing a deployed frontier model, a multimodal model, or an enterprise fine-tuned system with retrieval, tools, and policy layers wrapped around it. The method’s memory profile is promising, but scaling interpretability from controlled tasks to operational systems remains a separate challenge.

Finally, the paper flags a risk that should not be treated as decorative ethics. If a method can identify and remove task-relevant nodes, it might also be used to disable safety, refusal, or moderation-related pathways. Surgical interpretability can support governance. It can also support circumvention. The knife does not become moral because the paper has an appendix.

How Cognaptus would translate this into a model-risk workflow

The paper directly shows that multi-granular node pruning can find sparse, task-preserving circuits in GPT-2 small across three controlled tasks. Cognaptus would not infer from that result that enterprise LLM auditing is solved.

The more reasonable workflow implication is this:

Define the behavior narrowly. Do not ask whether “the model is biased” or “the model reasons.” Ask whether a specific input-output behavior survives targeted corruption.
Construct clean and corrupted examples. The corruption must preserve the task format while changing the relevant answer. This is where sloppy audit design will quietly ruin everything.
Run multi-granular circuit discovery. Learn masks across blocks, heads, MLPs, and neurons to identify the smallest behavior-preserving internal subset.
Compare task performance and distributional faithfulness. A circuit that preserves the headline answer but distorts the output distribution may not be a faithful behavioral explanation.
Use node results as diagnostic evidence, not final truth. If the behavior matters enough, follow up with interaction mapping, ablations, and human review.

That workflow is less glamorous than “transparent AI.” It is also more useful.

Transparency is not a property you sprinkle over a model after procurement. It is an investigation process. This paper improves one part of that process: finding a smaller and more precise set of internal components worth investigating.

The real contribution is changing the resolution of the question

The phrase “circuit discovery” can make the field sound as if the circuit is sitting inside the model, waiting politely to be extracted. The reality is messier. Every discovery method defines what it is capable of seeing. If the method sees only edges among coarse components, the discovered circuit will be shaped by that resolution.

This paper’s contribution is to change the resolution.

By learning masks across multiple granularities at once, it shows that task circuits can be sparse at several levels simultaneously. Whole blocks may disappear. Attention heads may vanish. Neurons inside retained components may still be unnecessary. And the surviving structure changes across IOI, Gendered Pronouns, and Greater-Than.

That is why the paper is best read mechanistically rather than as a leaderboard entry. The important move is not that one table has smaller numbers. The important move is that the method stops pretending that architectural units are functional atoms.

For businesses, the near-term payoff is not faster inference or a magic explanation button. It is better diagnostic economics: smaller candidate circuits, lower memory requirements, and more precise evidence about which internal components support a behavior. For governance teams, that can mean less guesswork. For model developers, it can mean sharper debugging. For everyone selling “AI transparency” as a dashboard, it may mean a slightly less comfortable Monday.

Good.

Interpretability should make vague claims uncomfortable.

Cognaptus: Automate the Present, Incubate the Future.

Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, and A.B. Siddique, “Multi-Granular Node Pruning for Circuit Discovery,” arXiv:2512.10903, 2025. https://arxiv.org/abs/2512.10903 ↩︎

The “important head” was never the whole story#

The mechanism: clean behavior fights corrupted behavior, and masks decide what survives#

Why coarse components can exaggerate circuit size#

What the experiments actually test#

The numbers: smaller circuits without obvious task collapse#

The task anatomy is the interesting part#

Baseline comparison: fewer heads, but not the same object#

The operational result: memory drops from expensive to plausible#

What this does not yet give us#

How Cognaptus would translate this into a model-risk workflow#

The real contribution is changing the resolution of the question#