State of Delay: KVBuffer and the Memory Tax of Linear Attention
Latency has a habit of hiding inside words that sound efficient. “Constant decoding cost” is one of those phrases. It suggests a clean engineering promise: linear attention avoids the context-length explosion of softmax attention, so long-context inference should become simpler, cheaper, and less melodramatic. Very nice. The GPU accountants, however, have not retired. ...