Opening — Why this matters now
Large language models appear smooth from the outside: prompts go in, coherent text comes out. But internally, their numerical dynamics are anything but calm. In fact, inside many modern Transformers, certain tokens briefly explode into extreme values thousands of times larger than their neighbors.
At the same time, a small set of tokens—often the very first token in a sequence—attracts an overwhelming share of attention from many heads. These are known as attention sinks.
Both phenomena have been observed for years, yet their relationship remained unclear. Were these behaviors meaningful computational strategies, or merely odd artifacts of deep neural networks?
A recent study provides one of the clearest mechanistic explanations yet. The answer is unexpectedly simple: the two phenomena frequently appear together not because they are functionally required, but because the standard Transformer architecture accidentally encourages them to interact.
Understanding this interaction matters. These internal dynamics affect model efficiency, quantization, pruning, and long‑context inference—areas where businesses deploying large models increasingly care about cost and reliability.
Background — Context and prior art
Modern large language models rely on the Transformer architecture, first introduced in 2017. Over time, several design choices became nearly universal:
- Decoder‑only architectures
- Pre‑normalization (Pre‑Norm) residual blocks
- Multi‑head attention
- Feed‑forward layers using SwiGLU or similar gating mechanisms
These models learn by predicting the next token in a sequence. During inference, each token’s representation is repeatedly transformed through attention layers and feed‑forward networks.
However, researchers analyzing hidden activations discovered two puzzling behaviors:
| Phenomenon | Description | Typical Location |
|---|---|---|
| Massive activations | Extremely large hidden values appearing in a few channels | Intermediate layers |
| Attention sinks | Certain tokens attract disproportionate attention | Across many heads |
These effects frequently occur on the same tokens—especially the first token of a sequence or punctuation tokens such as periods or newlines.
For years, the connection between them remained largely descriptive rather than mechanistic.
Analysis — What the paper actually reveals
The study provides a step‑by‑step explanation of how both behaviors arise from standard Transformer design.
1. Massive activations originate in early feed‑forward blocks
A small number of early feed‑forward layers act as directional amplifiers. When a token representation aligns with a particular direction in high‑dimensional space, the network amplifies it dramatically.
The lifecycle of these spikes follows a consistent pattern:
| Stage | Description |
|---|---|
| Step‑up | Early layers inject extremely large values |
| Plateau | Residual connections preserve them across layers |
| Step‑down | Late layers cancel them out |
This explains why spikes appear only in intermediate layers.
Interestingly, the amplification behaves approximately like a quadratic function of the input vector alignment:
$$ F(h) \approx \lambda (s^T h)^2 $$
where $s$ is the spike direction and $\lambda$ is a high‑gain eigenvalue.
In plain terms: if a token points in a particular direction in representation space, the model dramatically magnifies it.
2. Certain tokens reliably trigger spikes
Spike tokens are rarely random. Two categories dominate:
| Token Type | Reason |
|---|---|
| First token | Attention collapses to a simple self‑mapping |
| Delimiters (., \n) | Strong self‑attention behavior |
These tokens naturally align with the amplification direction generated by early layers.
Once aligned, the feed‑forward block amplifies them across several channels simultaneously.
3. Normalization converts spikes into stable vectors
The key bridge between spikes and attention sinks is RMS normalization.
Normalization performs three transformations:
| Property | Effect |
|---|---|
| Bounded magnitude | Prevents numerical instability |
| Sparsification | Suppresses non‑spike channels |
| Near‑constant representation | Different spike tokens become almost identical |
After normalization, spike tokens become sparse vectors with nearly identical directions.
This has a surprising consequence: many tokens that originally differed collapse into a similar representation before entering attention layers.
4. This collapse creates attention sinks
Once normalized, the spike tokens generate key vectors confined to a tiny subspace.
Attention heads then naturally fall into two categories:
| Head Type | Behavior |
|---|---|
| Sink heads | Queries align with spike‑token keys |
| Normal heads | Queries align with contextual tokens |
In sink heads, the fixed spike token becomes a convenient place to “dump” excess attention probability.
In other words, attention sinks function like a default parking spot for unused attention mass.
Findings — What the experiments show
The authors tested these ideas with large‑scale ablation experiments, retraining Llama‑style models with modified architectures.
Key experimental results
| Intervention | Effect on spikes | Effect on sinks |
|---|---|---|
| Sandwich normalization | Strongly reduces spikes | Sinks remain |
| DynamicTanh normalization | Eliminates spikes | Sinks persist |
| Conditional gating | Eliminates sinks | Performance unchanged |
| Larger attention head dimension | Increases sinks | Slightly improves perplexity |
Two conclusions stand out.
First, spikes and sinks can be independently removed.
Second, removing either phenomenon does not significantly degrade model performance.
This suggests their coexistence in modern models is mostly incidental rather than functionally essential.
Context length also matters
Another experiment altered the training distribution of sequence lengths.
| Training regime | Sink ratio |
|---|---|
| Mixed short/long contexts | High |
| Long‑context only | Very low |
This indicates attention sinks primarily help the model emphasize short‑range dependencies inside a global attention mechanism.
Implications — Why businesses should care
From an engineering perspective, these findings matter more than they might initially appear.
1. Quantization and efficiency
Massive activations create extreme outlier values that complicate low‑precision inference. Removing them can simplify model compression pipelines.
2. Architectural simplification
If spikes and sinks are architectural artifacts rather than necessities, future models can avoid them entirely through design changes such as:
- alternative normalization schemes
- explicit attention gating
- modified feed‑forward structures
3. Better long‑context behavior
Attention sinks bias models toward local dependencies. Reducing them may help models better utilize long contexts—an increasingly important capability in enterprise AI applications.
Conclusion — A reminder about neural networks
Large language models often appear mysterious, but many of their behaviors stem from straightforward interactions between architectural components.
Massive activations and attention sinks are not magical emergent reasoning structures. They are mostly the byproducts of three simple ingredients:
- residual accumulation
- normalization
- high‑gain directions in feed‑forward layers
In other words, the Transformer occasionally creates numerical quirks—and then cleverly learns to use them.
Understanding those quirks is how model architecture continues to improve.
Cognaptus: Automate the Present, Incubate the Future.