Cover image

The KV Cache Is Not a Detail: Why LLM Compression Needs a Control Plane

Bandwidth is one of those infrastructure costs that looks boring until it becomes the product bottleneck. A retrieval-augmented assistant gets a long document. An agentic workflow accumulates tool traces. A support chatbot reuses a large system prompt and a customer-history prefix. The model may be fast enough, the GPUs may be expensive enough, and yet the user still waits. Not because the model is thinking harder. Because the system is moving state. ...

May 27, 2026 · 15 min · Zelina