Attention, But Make It Optional

Opening — When more layers stop meaning more intelligence

The scaling era taught us a simple mantra: stack more layers, get better models. Then deployment happened. Suddenly, latency, energy bills, and GPU scarcity started asking uncomfortable questions—like whether every layer in a 40-layer Transformer is actually doing any work.

This paper answers that question with unsettling clarity: many attention layers aren’t lazy—they’re deliberately silent. And once you notice that, pruning them becomes less of an optimization trick and more of a design correction.

Background — Depth is the last unchallenged axis

Most efficiency work in large language models attacks two dimensions:

Axis	Typical method	Hidden cost
Precision	Quantization (INT8/4/2)	Hardware-specific kernels, calibration
Width	Weight pruning / sparsity	Sparse kernels, retraining

Depth—the number of layers—has largely escaped serious scrutiny. Dropping full Transformer blocks does save time, but tends to destroy accuracy. Prior work (e.g. cosine-similarity pruning) showed that attention sublayers are often more redundant than MLPs, yet relied on calibration data and multiple forward passes.

Which raises the obvious question: why is attention redundant so often—and can we detect it without touching any data at all?

Analysis — The Attention Suppression Hypothesis

The paper proposes a clean, slightly heretical idea: during pretraining, deep attention layers learn to mute themselves.

Structurally, this makes sense. In a Transformer block:

X → [Attention] → + → [MLP] → +

If the attention update shrinks toward zero, the residual path simply feeds the MLP unchanged. The model keeps working. Nothing breaks. But attention quietly steps aside.

Empirical fingerprints of silence

Two signals confirm this behavior:

Cosine similarity collapse: in late layers, the cosine similarity between a layer’s input and post-attention output approaches 1—mathematically implying the attention update norm approaches zero.
Attention/Input norm ratio: the ratio of (|\text{AttnOut}| / |X|) falls from >1 in early layers to <0.1 in deep layers for LLaMA-13B. Over 90% of the signal flows through the residual.

At that point, attention isn’t mixing tokens anymore. It’s just… there.

Implementation — Gate-Norm: pruning without data, passes, or guilt

Instead of measuring behavior with data, the authors inspect weights alone.

Core idea

If attention works via query–key interaction, then weak query–key coupling implies weak token mixing.

So for each attention layer (\ell):

[ M_\ell = W_{q,\ell} W_{k,\ell}^\top, \quad m_\ell = |M_\ell|_F ]

That’s it.

Small (m_\ell) ⇒ nearly uniform attention ⇒ vanishing updates
Large (m_\ell) ⇒ meaningful token interaction

One-shot pruning algorithm

Compute (m_\ell) for all layers
Sort layers ascending
Disable attention in the bottom (N) layers

No calibration data. No forward passes. No fine-tuning. No special kernels.

On a 13B model:

Environment	Time
GPU	~300 ms
CPU only	<30 s

That’s roughly 1000× faster than data-driven pruning pipelines.

Findings — Speed without accuracy collapse

Perplexity

Across LLaMA-13B (v1 & v2), Vicuna-7B/13B, and LLaMA-3.1-8B:

Pruning 8–16 attention layers keeps perplexity close to baseline
Gate-Norm matches or slightly beats data-driven attention pruning
Random pruning is, predictably, catastrophic

Zero-shot accuracy

On BoolQ, RTE, HellaSwag, WinoGrande, ARC, and OpenBookQA:

Pruned attention layers	Avg accuracy drop	Throughput gain
4	<1%	~6%
8	~1–2%	~12%
16	~2–3%	~30%

Block-level pruning cannot match this trade-off. Random removal doesn’t even try.

Where pruning happens

Both Gate-Norm and data-driven methods overwhelmingly remove late-stage attention layers—exactly where attention suppression is strongest. The agreement is so tight it borders on uncomfortable for calibration-heavy methods.

Implications — What this means for real systems

Attention is not uniformly valuable: late-layer attention often contributes less than residual noise.
Data-free optimization is viable: weight geometry alone can reveal functional redundancy.
On-device pruning becomes realistic: no data leakage, no GPUs, no retraining loops.
Architecture design is implicated: if attention self-suppresses, maybe late-stage token mixing is simply unnecessary.

This also reframes dynamic routing and early-exit research: instead of learning when to skip attention, we can sometimes decide once, ahead of time.

Conclusion — Less attention, more signal

This paper doesn’t argue that attention is overrated. It argues something more precise—and more unsettling: attention already knows when it’s no longer needed.

Gate-Norm merely listens.

By turning silent layers into removable structure, the work offers a rare combination in LLM optimization: theoretical clarity, empirical consistency, and immediate deployability.

Cognaptus: Automate the Present, Incubate the Future.

Opening — When more layers stop meaning more intelligence#

Background — Depth is the last unchallenged axis#

Analysis — The Attention Suppression Hypothesis#

X → [Attention] → + → [MLP] → +#

Empirical fingerprints of silence#

Implementation — Gate-Norm: pruning without data, passes, or guilt#

Core idea#

One-shot pruning algorithm#

Findings — Speed without accuracy collapse#

Perplexity#

Zero-shot accuracy#

Where pruning happens#

Implications — What this means for real systems#

Conclusion — Less attention, more signal#