Cover image

Attention, But Make It Optional

Opening — When more layers stop meaning more intelligence The scaling era taught us a simple mantra: stack more layers, get better models. Then deployment happened. Suddenly, latency, energy bills, and GPU scarcity started asking uncomfortable questions—like whether every layer in a 40-layer Transformer is actually doing any work. This paper answers that question with unsettling clarity: many attention layers aren’t lazy—they’re deliberately silent. And once you notice that, pruning them becomes less of an optimization trick and more of a design correction. ...

December 27, 2025 · 4 min · Zelina
Cover image

When Circuits Go Atomic: Pruning Transformers One Neuron at a Time

Opening — Why this matters now Mechanistic interpretability has a scaling problem. As language models grow larger and more embedded in high‑stakes workflows, the old habit of waving at “important attention heads” is starting to look quaint. If we want to understand how models reason — not just where something lights up — we need circuit discovery methods that scale without drowning GPUs in activations or collapsing everything into blunt architectural units. ...

December 12, 2025 · 4 min · Zelina