Attention, But Make It Optional
Opening — When more layers stop meaning more intelligence The scaling era taught us a simple mantra: stack more layers, get better models. Then deployment happened. Suddenly, latency, energy bills, and GPU scarcity started asking uncomfortable questions—like whether every layer in a 40-layer Transformer is actually doing any work. This paper answers that question with unsettling clarity: many attention layers aren’t lazy—they’re deliberately silent. And once you notice that, pruning them becomes less of an optimization trick and more of a design correction. ...