When Tokens Explode: The Hidden Geometry Behind Attention Sinks

Serving an LLM is usually discussed in pleasantly managerial language: latency, throughput, context windows, GPU memory, quantization, cache eviction. Nice clean nouns. Then the model ruins the spreadsheet by producing internal activations that are thousands of times larger than ordinary values, while some tokens quietly become attention magnets for reasons that are not exactly semantic. Very professional behavior from a trillion-dollar technology stack.

The paper behind today’s article, The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks, studies two of these internal habits: massive activations, where a few tokens explode in a few hidden channels, and attention sinks, where some tokens attract disproportionate attention mass regardless of whether they are meaningfully relevant to the current query.¹

The tempting explanation is simple: the same tokens often show both behaviors, so perhaps massive activations cause attention sinks. Or perhaps attention sinks require those huge outliers. That explanation is attractive because it is short. Unfortunately, short explanations are often where engineering mistakes go to retire.

The paper’s more useful claim is subtler. Massive activations and attention sinks often co-occur in modern pre-norm Transformer models, but their overlap is largely an architectural byproduct. The connection runs through a mechanism:

early blocks create spike-token outliers;
the residual stream preserves them;
RMSNorm compresses these huge values into sparse, near-constant vectors;
some attention heads learn geometry that turns these vectors into privileged sink keys;
targeted ablations show that spikes and sinks can be separated without obvious perplexity damage.

This matters because infrastructure teams often treat outliers, sinks, quantization failures, and long-context behavior as one unpleasant serving problem. The paper suggests a cleaner diagnosis: activation spikes and attention-sink routing are related but separable. That distinction is not academic tidiness. It affects how we think about low-precision inference, KV-cache policies, architecture selection, and long-context serving.

The paper is not just saying “some activations are large”

Large activations in LLMs are not new. The practical pain is familiar: when a small number of channels produce extreme values, low-bit quantization becomes harder. Scaling ranges are distorted. A few numerical monsters force the rest of the representation to live with poor resolution. Inference engineers then do what inference engineers do: add clever scaling, clipping, mixed precision, special channels, or another paper with a colon in the title.

Attention sinks are a different irritation. In a Transformer, a token should attend to earlier tokens according to learned relevance. But some positions, especially the first token or delimiter-like tokens, can attract attention from many heads even when their semantic content is not particularly helpful. These sink tokens behave like numerical dumping grounds or routing anchors.

The paper asks a specific question: why do these two phenomena often involve the same tokens?

It does not stop at “the first token is special” or “softmax likes a place to put attention.” Those are descriptions. The authors try to trace the path from architectural design to learned geometry.

That is why the best way to read this paper is mechanism-first. The evidence matters, but the main contribution is the causal chain.

Step one: spike tokens are manufactured early, then carried through the model

The paper begins by tracking the largest hidden-coordinate magnitudes across depth in Llama and Qwen-family models. The pattern is not random turbulence. It has a life cycle:

Stage	What happens	Why it matters
Step-up	One or a few early blocks inject extremely large values into specific channels	Massive activations are created abruptly, not slowly everywhere
Plateau	The residual stream carries these values through intermediate layers	Pre-norm residual accumulation lets outliers persist
Step-down	Late blocks inject values of opposite sign and neutralize the spikes	Spikes disappear before the final output layer

The important detail is the residual stream. In a pre-norm Transformer block, each block adds its output to the existing hidden state. Once a block inserts a huge value into a channel, later blocks do not need to keep recreating it. The value simply rides along unless another block cancels it.

The paper identifies this “rise–plateau–fall” pattern across several open models. In the reported table, for example, Llama 2 7B has a step-up block near the beginning and a step-down block near the end. Qwen variants show the same general pattern, although the exact block indices differ. This is not a tiny quirk of one checkpoint.

The mechanism is also not merely “the model has large weights.” The authors argue that the feed-forward block, especially the SwiGLU-style block used in modern Llama-like architectures, can act as a directional quadratic amplifier. Under the near-identity regime of the activation function for spike tokens, the feed-forward coordinate behaves like a quadratic form. For certain spike channels, that quadratic form is dominated by one very large eigenvalue. When a token representation aligns with the corresponding direction, the result is amplified sharply.

A simpler version: the model has a few hidden tripwires. Most token representations miss them. First tokens and delimiter-like tokens often step directly on them.

This explains several properties at once. The spikes appear only in a few channels because only a few output coordinates have these high-gain quadratic forms. The affected channels spike together because they share nearly the same triggering direction. The magnitude ratios stay stable because the same underlying direction and eigenvalue pattern governs the amplification. And only some tokens spike because only some token states align with that trigger direction.

That is a real mechanism, not just a scatterplot with vibes.

First tokens and delimiters are not magical; they are mechanically convenient

The first token has an architectural privilege: at position one, causal attention can only attend to itself. The attention computation collapses into a stable transformation that is effectively independent of preceding context, because there is no preceding context. This gives the first token a predictable trajectory through early layers.

The paper reports a vocabulary-wide probe: almost all vocabulary items become spike tokens when placed in the first position, with ratios above 98% across the evaluated Llama and Qwen models. That is a strong clue that position, not token meaning, is driving the effect.

Delimiter tokens, such as periods and newlines, follow a related route. They can develop strong self-sinking behavior in early attention blocks. Once a delimiter token behaves as if it is isolated from the rest of the context, it can be pushed along a similar stable trajectory toward the high-gain spike direction.

This is a useful correction to a common reader belief. The tokens are not important because they are semantically profound. A period is not secretly the CEO of the prompt. These tokens become useful because the architecture gives them stable computational roles. The model then learns to exploit those roles.

RMSNorm turns explosion into sparsity

Now comes the bridge between massive activations and attention sinks.

A spike token has extreme values in a few channels. But attention blocks in pre-norm Transformers do not consume the raw residual stream directly. They consume normalized hidden states.

RMSNorm changes the interpretation of the spike. It does three things:

RMSNorm effect	Mechanism	Consequence
Bounded range	The enormous coordinates are scaled down by the vector norm	The model avoids feeding raw huge values into attention
Sparsification	The norm is dominated by the spike channels	Non-spike channels are suppressed
Near-constant collapse	Spike channels have stable ratios across spike tokens	Different spike tokens become almost the same vector after normalization

This is the paper’s central hinge. The raw spike is large, but after normalization it becomes a sparse, bounded, nearly token-invariant vector. In other words, the spike token stops being a normal contextual representation and starts behaving like a small learned constant inserted into the forward pass.

That phrase deserves attention: implicit parameter. The authors argue that massive activations can function like input-dependent but near-constant internal parameters. They are not learned parameters in the ordinary weight-matrix sense, but once created inside a forward pass, they provide a stable vector that later attention blocks can use.

This is where a naive reading goes wrong. The model is not necessarily using “large numbers” directly because large numbers are inherently useful. It is using architecture and normalization to turn a large residual artifact into a clean geometric object.

That is both elegant and slightly annoying, which is usually how deep learning works.

Sink heads are created by geometry, not by the spike alone

After RMSNorm, the spike token projects into query, key, and value spaces. For a sink token, the key vector is constrained by the sparse normalized representation. Since only a few spike channels dominate, the resulting key lies in a low-dimensional subspace determined by a few rows of the key projection matrix.

The paper’s geometric claim is that sink formation depends on how this sink-key subspace sits relative to ordinary query and key subspaces.

A head becomes a sink head when the sink key is consistently better aligned with many non-sink queries than ordinary non-sink keys are. This creates a stable logit advantage for the sink token. A non-sink head lacks that geometry; its queries and keys remain governed more by token-specific content.

So the spike is not the entire story. The spike creates a sparse, stable candidate object. Attention-head geometry determines whether that object becomes a sink.

This distinction matters for interpreting the ablations later. If spikes directly caused sinks in a simple one-way chain, removing spikes should remove sinks. But the paper finds several cases where spikes can be suppressed while sinks remain. That means attention sinks are not just “large activations seen through attention.” They are learned routing behavior that can use spike-derived vectors when available, but can also find other routes.

What the experiments are actually testing

The paper’s ablation section is not a random tour of architecture knobs. It has a logic. Each experiment attacks one possible explanation.

Test family	Likely purpose	What it supports	What it does not prove
Optimization hyperparameters	Sensitivity / diagnostic test	Sink ratio tracks optimization health more than spike magnitude does	That sink ratio alone is a full quality metric
Feed-forward design	Mechanism ablation	SwiGLU and GeLU efficiently amplify spikes, but the exact FFN design is not required for sinks	That all architectures will behave identically at larger scales
Normalization configuration	Main causal ablation	Massive activations can be suppressed while preserving sink behavior	That all normalization changes are production-safe
Attention head settings	Geometric mechanism test	Larger per-head dimensions strengthen sink formation	That increasing head dimension is always desirable
Gated attention	Functional replacement test	Dynamic conditional gates can reduce the need for sink routing	That gating is automatically better for deployed models
Training context length	Training-distribution test	Sinks are strongly tied to short-context prediction needs	That long-context-only training is generally better

This classification is important because not every table in a paper carries the same burden. Some tests provide main evidence. Some are robustness checks. Some probe mechanisms. Some show implementation boundaries. Mixing those together produces the usual LinkedIn-grade conclusion: “Architecture matters.” Thank you, very enlightening.

The paper’s stronger conclusion is more precise: spikes and sinks respond differently to interventions, so their co-occurrence is not a functional necessity.

Normalization is the cleanest decoupling lever

The normalization ablations are the most business-relevant part of the paper.

The baseline pre-norm setup reports perplexity around 10.1, a sink ratio of 46.0%, and spike magnitude of 3818 in the authors’ 7B ablation setting. When the authors apply sandwich normalization, the spike drops to 520 while the sink ratio remains 44.7%. With the QK-focused sandwich variant, the spike falls further to 92 while the sink ratio remains 42.0%. DynamicTanh gives a spike of 153 and a sink ratio of 61.0%.

The exact numbers should not be overgeneralized. These are controlled 7B models trained on 100B tokens, not a universal law for every model family. But the direction is hard to ignore: large residual spikes can be suppressed while attention sinks survive.

That is the decoupling result. It means the model does not require massive activations in order to form sink-like routing. Standard pre-norm RMSNorm creates a convenient bridge from spike to sink, but it is not the only way for the model to produce sink behavior.

For inference teams, this is the practical insight. If you are trying to make quantization easier by suppressing activation outliers, you should not automatically assume that you are destroying useful attention-routing behavior. The paper suggests these can be engineered separately.

That does not mean every normalization replacement is a free lunch. It means the old fear — “remove the outliers and maybe the model loses an important attention mechanism” — is too blunt.

Gated attention makes sinks look like a workaround

The gated attention experiments sharpen the interpretation. Conditional gating, where the gate depends on the current hidden representation, drastically suppresses sink ratios and spikes with little perplexity movement in the reported setting. Per-channel conditional gating gives a sink ratio of 4.5% and spike magnitude of 202. Per-head conditional gating gives a sink ratio of 6.4% and spike magnitude of 186. Static or unconditional gates do not have the same effect.

The difference is the word conditional. A static gate cannot adapt to the evolving prompt history. It is just another fixed knob. But a representation-conditioned gate can modulate routing dynamically. When the model has this explicit routing tool, it no longer needs to rely as heavily on attention sinks as an implicit workaround.

This gives attention sinks a functional interpretation. They are not merely numerical garbage. They can act like learned routing devices, especially for heads that need to reduce or redirect attention under certain contexts.

The unflattering version is that attention sinks are a hack. The more polite version is that they are an emergent architectural adaptation. Both descriptions fit; one just wears a nicer jacket.

Context length explains why sinks are useful in the first place

The training-context ablation adds another layer. When training includes short contexts, sink ratios remain high across maximum context-length settings. But when the training distribution excludes short contexts and only optimizes over long-range positions, the sink ratio collapses.

In the paper’s table, context ranges beginning at position 1 show sink ratios around the 42–46% range. When the minimum context length is moved to 1024 or 2048, sink ratios drop sharply, as low as 1.2% in one setting.

The authors interpret this as evidence that sinks support short-range prediction inside a global attention mechanism. A sink token gives certain heads a way to reduce attention to irrelevant distant tokens and bias computation toward local structure. If the model is trained mostly in regimes where long-range context is always relevant, that shortcut becomes less useful.

This is a useful reminder for long-context product teams. Long context is not just “more tokens.” It changes the incentives under which attention patterns are learned. A model trained with abundant short-context objectives may develop routing habits that are convenient for local prediction but awkward for deep retrieval across the middle of a long prompt.

That does not mean attention sinks are bad. It means they are a serving variable, not a philosophical flaw.

The business meaning is diagnosis, not model mysticism

For AI infrastructure and product teams, the paper’s contribution is not that every company should redesign its Transformer stack tomorrow morning. Please do not send this article to procurement with “replace RMSNorm” highlighted.

The practical value is diagnostic separation.

Paper result	Operational interpretation	Business relevance	Boundary
Massive activations are generated by step-up mechanisms and preserved through residuals	Outlier problems may originate from specific architectural locations, not uniformly across the model	Better targeting for quantization and monitoring	Exact block behavior varies by model family
RMSNorm converts spikes into sparse near-constant vectors	Normalization can transform numerical outliers into usable routing objects	Architecture choices affect both numerical stability and attention behavior	Suppressing spikes may change other internal strategies
Sink heads depend on attention geometry	Sinks are head-level routing phenomena, not just token-level oddities	KV-cache and long-context policies should preserve or handle relevant sink behavior intentionally	Sink ratio is not a complete quality metric
Conditional gating suppresses sinks	Explicit dynamic routing can replace implicit sink routing	Future architectures may reduce reliance on numerical workarounds	Gating adds design complexity and needs separate validation
Short-context training encourages sinks	Attention sinks partly reflect training distribution	Long-context model evaluation should inspect learned routing habits	Long-context-only training is not automatically superior

The ROI implication is not glamorous, but it is real: cheaper failure diagnosis. When a deployed model behaves badly under quantization, long-context retrieval, or cache compression, teams need to know whether they are facing an activation-outlier problem, an attention-routing problem, or a policy mismatch between training and serving context.

Bundling all of that under “the model has attention weirdness” is not a strategy. It is a shrug with GPU invoices attached.

What the paper directly shows, and what Cognaptus infers

It is worth separating the layers of claim.

The paper directly shows that in the evaluated Llama and Qwen-style models, massive activations follow a step-up, plateau, and step-down life cycle; that feed-forward blocks can act as directional amplifiers; that RMSNorm maps spike tokens into sparse, near-constant vectors; that attention sinks depend on head-space geometry; and that controlled ablations can suppress spikes and sinks independently without clear perplexity degradation in the authors’ experimental setup.

Cognaptus infers that production teams should treat activation outliers and attention sinks as separable engineering targets. Quantization work should not automatically assume that spike suppression destroys sink utility. Long-context serving should not treat first-token attention as merely a nuisance. Architecture evaluation should inspect whether a model is using numerical artifacts as implicit routing devices.

What remains uncertain is how far these findings transfer across model scales, training recipes, modalities, and commercial closed-weight architectures. The paper studies open Llama/Qwen-style models and trains Llama-style 7B ablation models on 100B tokens. That is serious evidence, but not a universal production guarantee. Also, perplexity is not the same as downstream instruction-following quality, tool-use reliability, retrieval robustness, or business-task performance.

This boundary matters. A model can keep perplexity stable and still change behavior in ways that matter for enterprise workflows. The paper does not claim otherwise. Neither should we.

The uncomfortable lesson: models learn around architecture

The deeper lesson is that LLMs do not merely use the architecture we think we gave them. They also learn around its constraints.

Pre-norm residual streams allow large values to persist. RMSNorm turns those values into sparse constants. Attention heads learn geometry that makes those constants useful. Training distributions reward routing shortcuts. Remove one pathway, and the model may discover another.

This is not a reason to be cynical about architecture research. It is the reason architecture research matters. If models are going to invent internal workarounds anyway, the engineering question is whether we give them clean, stable, inspectable tools — or whether we let them build numerical plumbing behind the wall and then act surprised when the basement floods.

The paper’s best contribution is therefore not a single recipe. It is a better map of the plumbing.

For business readers, the takeaway is simple: the next generation of efficient LLM systems will not be won only by better compression tricks or larger context windows. It will be won by teams that understand which internal behaviors are functional, which are accidental, and which are accidental but useful.

Attention sinks and massive activations sit exactly in that uncomfortable middle category. They are not pure bugs. They are not sacred features. They are learned adaptations to architectural incentives.

And as usual, the model did not ask permission before becoming complicated.

Cognaptus: Automate the Present, Incubate the Future.

Shangwen Sun, Alfredo Canziani, Yann LeCun, and Jiachen Zhu, “The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks,” arXiv:2603.05498v1, 5 March 2026, https://arxiv.org/abs/2603.05498. ↩︎

The paper is not just saying “some activations are large”#

Step one: spike tokens are manufactured early, then carried through the model#

First tokens and delimiters are not magical; they are mechanically convenient#

RMSNorm turns explosion into sparsity#

Sink heads are created by geometry, not by the spike alone#

What the experiments are actually testing#

Normalization is the cleanest decoupling lever#

Gated attention makes sinks look like a workaround#

Context length explains why sinks are useful in the first place#

The business meaning is diagnosis, not model mysticism#

What the paper directly shows, and what Cognaptus infers#

The uncomfortable lesson: models learn around architecture#