LLM Serving

State of Delay: KVBuffer and the Memory Tax of Linear Attention

Latency has a habit of hiding inside words that sound efficient. “Constant decoding cost” is one of those phrases. It suggests a clean engineering promise: linear attention avoids the context-length explosion of softmax attention, so long-context inference should become simpler, cheaper, and less melodramatic. Very nice. The GPU accountants, however, have not retired. ...

The KV Cache Is Not a Detail: Why LLM Compression Needs a Control Plane

Bandwidth is one of those infrastructure costs that looks boring until it becomes the product bottleneck. A retrieval-augmented assistant gets a long document. An agentic workflow accumulates tool traces. A support chatbot reuses a large system prompt and a customer-history prefix. The model may be fast enough, the GPUs may be expensive enough, and yet the user still waits. Not because the model is thinking harder. Because the system is moving state. ...

$Cover image$

Queue Who’s Optimizing: Why LLM Serving Needs Math, Not More Vibes

Opening — Why this matters now The first wave of enterprise AI adoption was obsessed with model choice. Which model is smarter? Which model writes better? Which model can reason, code, browse, call tools, summarize contracts, and politely pretend it enjoys quarterly planning? That was the easy part. The less glamorous question is now becoming more expensive: how do we serve all these model calls reliably, cheaply, and at scale? ...

The Tail That Wags the Model: Why p99 Latency Should Run Your LLM

A demo can survive a slow answer. A production service cannot survive the slow answer that arrives just often enough to make users stop trusting the product. That is the quiet problem behind p99 latency. The average response time tells you how the service feels on a normal day. p99 tells you what happens to the unlucky one percent: the support agent waiting in front of a customer, the analyst refreshing a dashboard, the employee whose workflow now includes watching a spinner and reconsidering their life choices. ...

Kernel Kombat: How Multi‑Agent LLMs Squeeze 1.32× More From Your GPUs

Kernel Kombat: How Multi-Agent LLMs Squeeze 1.32× More From Your GPUs GPU bills have a charming way of turning “just one more model deployment” into a finance meeting. For companies running large language model serving stacks, the problem is rarely that nobody knows GPUs matter. Everyone knows. The harder problem is that performance bottlenecks often live inside kernels most executives will never see: attention merges, normalization fusions, activation multiplications, tiny pieces of code called millions or billions of times until “small inefficiency” becomes “why is the infrastructure budget wearing a crown?” ...