Opening — Why this matters now
Mechanistic interpretability has a scaling problem. As language models grow larger and more embedded in high‑stakes workflows, the old habit of waving at “important attention heads” is starting to look quaint. If we want to understand how models reason — not just where something lights up — we need circuit discovery methods that scale without drowning GPUs in activations or collapsing everything into blunt architectural units.
This paper lands squarely on that fault line. Its claim is simple but uncomfortable for existing tooling: circuits don’t live neatly at the level of heads or MLP blocks. They fragment. They thin out. And in many cases, they reduce to a surprisingly small set of neurons.
Background — From edge surgery to node anatomy
Most automated circuit discovery methods today are variations of controlled demolition. You start with a full model, then iteratively remove edges between components while watching task performance flinch. ACDC, EAP, and related approaches made this practical — but at a cost.
Two costs, specifically:
- Quadratic scaling: edges grow far faster than components, making fine‑grained analysis computationally brutal.
- Coarse assumptions: attention heads and MLP blocks are treated as indivisible atoms, even though prior interpretability work has repeatedly shown they aren’t.
In parallel, classic pruning research focused on compression, not understanding. It optimizes for deployment efficiency, not explanatory clarity. Circuit discovery, by contrast, wants minimal functional subnetworks — even if they’re useless for inference speed.
This paper bridges that gap by reframing circuit discovery as a node‑level problem rather than an edge‑level one.
Analysis — What the paper actually does
The core idea is deceptively elegant: apply learnable masks to nodes across multiple granularities — not just heads, but blocks, MLP layers, and individual neurons — and learn them all at once in a single optimization run.
Multi‑granular masking
Instead of choosing a single level of abstraction, the model exposes five simultaneously:
- Transformer blocks
- Attention heads
- MLP blocks
- Attention neurons
- MLP neurons
Each node gets a mask parameterized via a Hard‑Concrete distribution, encouraging near‑binary behavior while remaining differentiable.
Two‑stream training: clean vs. corrupted
The framework uses a clever two‑stream forward pass:
- A clean stream, interpolated by masks
- A corrupted stream, generated by task‑specific perturbations
At each node, activations are mixed as:
$$ h = m \cdot h_{clean} + (1 - m) \cdot h_{corrupted} $$
If corrupting a node wrecks task performance, the optimizer learns to keep its mask near 1. If nothing changes, that node quietly disappears.
Crucially, this avoids iterative pruning entirely. No repeated forward passes. No massive activation caches. Just one fine‑tuning loop with sparsity regularization.
Findings — Smaller circuits, sharper structure
The empirical results are where things get interesting.
Circuit size collapses
Across three canonical tasks — Indirect Object Identification (IOI), Gendered Pronouns (GP), and Greater‑Than (GT) — the discovered circuits are dramatically smaller than those found by edge‑pruning baselines.
| Task | Attention Heads Retained | Edge Compression |
|---|---|---|
| IOI | 21 | 96.7% |
| GP | 37 | 93.7% |
| GT | 28 | 96.0% |
But the real pruning happens inside components. Large fractions of neurons within “important” heads and MLPs are simply irrelevant.
Task‑dependent anatomy
Different tasks carve the model differently:
- IOI keeps most MLP blocks but aggressively prunes attention heads until only late‑layer specialists remain.
- GT relies on mid‑layer attention while discarding both early and late heads.
- GP collapses much of the mid‑stack entirely, retaining sparse late‑layer circuitry.
This reinforces a key interpretability lesson: there is no universal circuit template. Structure follows task.
Faithfulness without fragility
Despite aggressive pruning, task performance barely budges — and sometimes improves. KL divergence between full‑model and circuit outputs remains low across all tasks, indicating the circuits reproduce not just answers, but distributions.
Implications — Why this matters beyond interpretability
This work quietly shifts the center of gravity for circuit discovery:
- Scalability: Node pruning reduces memory use by 5–10× compared to edge methods.
- Precision: Circuits become neuron‑level objects, not architectural slogans.
- Flexibility: Different granularities compete naturally instead of being pre‑selected.
For practitioners, this suggests a future where interpretability tools can be applied routinely — not just as bespoke research projects.
For safety and governance, it’s a double‑edged sword. If you can surgically remove task circuitry, you can also remove safeguards. The paper flags this risk explicitly, and it deserves attention.
Conclusion — Circuits, now with a microscope
Edge pruning taught us where computation flows. Node pruning shows us how thin that flow really is.
By treating neurons, heads, and blocks as peers in a single optimization objective, this framework exposes circuits as sparse, task‑specific, and often surprisingly fragile structures. It’s a reminder that interpretability doesn’t always require more probes — sometimes it just needs a sharper knife.
Cognaptus: Automate the Present, Incubate the Future.