Sparse Attention

Opening — Why this matters now Large Language Models are steadily marching toward million‑token contexts. The promise is seductive: entire codebases, legal archives, or research libraries available inside a single prompt. The reality, however, is less glamorous. Before a model generates its first token, it must prefill the entire prompt into the Transformer. This stage alone can dominate inference latency for long documents. Because attention scales quadratically with sequence length, doubling the context can quadruple the compute. ...