When Tokens Explode: The Hidden Geometry Behind Attention Sinks
Opening — Why this matters now Large language models appear smooth from the outside: prompts go in, coherent text comes out. But internally, their numerical dynamics are anything but calm. In fact, inside many modern Transformers, certain tokens briefly explode into extreme values thousands of times larger than their neighbors. At the same time, a small set of tokens—often the very first token in a sequence—attracts an overwhelming share of attention from many heads. These are known as attention sinks. ...