Gated Sparse Attention: Speed Without the Sink
Opening — Why this matters now Long-context language models have crossed an uncomfortable threshold. Context windows now stretch to 128K tokens and beyond, yet the core attention mechanism still scales quadratically. The result is a growing mismatch between what models can theoretically ingest and what is economically and operationally feasible. At the same time, training instability — loss spikes, attention sinks, brittle gradients — continues to haunt large-scale runs. ...