Inference Optimization

Measure Twice, Quantize Once

TL;DR for operators Compression is usually sold as a tidy pipeline: pick a smaller architecture, prune some layers, quantize the result, then call procurement and explain why the GPU bill is still rude. This paper argues that the pipeline itself is the problem.1 The authors propose a joint compression framework for Llama-3.1-8B that searches architectural choices and quantization choices together. That means the system does not first decide “how much model” it wants and only afterward decide “how many bits” each part deserves. It treats width, depth, layer importance, weight precision, activation precision, and latency as interacting deployment variables. ...

Context Is Not Free, So Stop Feeding the Whole Table

TL;DR for operators Many tabular foundation models behave like very competent consultants with a mildly expensive habit: they want the entire labelled training set placed in front of them at inference time. That works neatly on small datasets. It becomes rather less charming when the table grows to tens or hundreds of thousands of rows and the model’s attention cost starts behaving like it has discovered compound interest. ...

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below Enterprise AI has entered its awkward teenage years. It wants to be autonomous, helpful, context-aware, cheap, safe, fast, auditable, and preferably not the reason the legal department starts drinking before lunch. That is a lot to ask from “just use a bigger model.” ...

State of Delay: KVBuffer and the Memory Tax of Linear Attention

Latency has a habit of hiding inside words that sound efficient. “Constant decoding cost” is one of those phrases. It suggests a clean engineering promise: linear attention avoids the context-length explosion of softmax attention, so long-context inference should become simpler, cheaper, and less melodramatic. Very nice. The GPU accountants, however, have not retired. ...

Beam Me Less, Scotty: MoE Models Learn When Not to Call Every Expert

Latency has a way of turning elegant model architecture into an invoice. Mixture-of-Experts models were supposed to soften that invoice. Instead of sending every token through the same dense feed-forward machinery, an MoE layer sends each token to only a few experts. In theory, this gives us scale without paying for all parameters on every token. In practice, many deployed MoE models still behave like a restaurant that insists every guest order the same number of dishes. The experts differ, but the billable count is fixed. ...

Place Your Experts, Not Your Bets

Opening — Why this matters now The fashionable version of AI strategy still sounds suspiciously like a gym membership pitch: bigger model, more parameters, more GPUs, more everything. The operational version is less glamorous and much more important: where does the computation happen, which parts of the model are actually used, how predictable is demand, and whether the system can turn those facts into lower latency, lower cost, or better decisions. ...

$Cover image$

Queue Who’s Optimizing: Why LLM Serving Needs Math, Not More Vibes

Opening — Why this matters now The first wave of enterprise AI adoption was obsessed with model choice. Which model is smarter? Which model writes better? Which model can reason, code, browse, call tools, summarize contracts, and politely pretend it enjoys quarterly planning? That was the easy part. The less glamorous question is now becoming more expensive: how do we serve all these model calls reliably, cheaply, and at scale? ...

Claw and Order: Why AI Agents Need a Precision Budget

Opening — Why this matters now AI agents are leaving the demo cage. They are no longer just politely completing prompts; they are planning workflows, calling tools, reading files, coordinating intermediate steps, and accumulating context like a bureaucrat hoarding PDFs. This is useful. It is also expensive. The paper “QuantClaw: Precision Where It Matters for OpenClaw” studies a problem that sounds technical but is really managerial: agent systems often run every task at a fixed numerical precision, even though not every task deserves the same computational budget.1 A safety-critical terminal command and a lightweight retrieval summary are not the same species of work. Treating them identically is the infrastructure equivalent of sending a limousine to deliver printer paper. ...

Squeeze Evolve: When AI Stops Thinking Alone and Starts Allocating Intelligence

Budget is where many impressive AI demos go to become ordinary software. A model can reason longer. It can sample more. It can revise itself, compare candidates, aggregate outputs, and repeat the whole ritual until the invoice starts looking like a small infrastructure project. The obvious response is to ask whether the strongest model should simply do all of this work. Obvious, yes. Economically elegant, not quite. ...

Whispers Against the Noise: How Contrastive Decoding Tames Long‑Form ASR Hallucinations

A transcript is usually treated as boring infrastructure. It sits underneath meeting summaries, call-center analytics, podcast search, earnings-call review, legal discovery, medical documentation, and the cheerful dashboard that tells managers everything is now “AI-powered.” Then the transcript invents a sentence. Not a typo. Not a small mishearing. A fluent, confident, context-shaped sentence that nobody said. In short clips, this is irritating. In long recordings, it becomes structural. One bad segment can become context for the next segment; the next segment inherits the mistake; and soon the system is not transcribing a recording so much as continuing a badly seeded story. ...