LLM Inference

Measure for Measure: Why AI Evaluation Must Follow the Failure

TL;DR for operators A lower model bit width is not automatically a speedup. A lower training loss is not automatically a reliable policy. GRINQH evaluates quantization through the mechanism it is meant to change: decoding-stage memory traffic, kernel throughput, end-to-end generation speed, and retained task accuracy.1 Kolmogorov regression evaluates diffusion policies through trajectory geometry, a PDE-based inference residual, rollout behavior, anomaly detection, and an external safety filter.2 The shared lesson is not that the two forms of “precision” are technically equivalent. They are not. The lesson is that fidelity and evidence should be allocated according to the actual failure structure of the system. A production evaluation should connect four things explicitly: the intervention, the mechanism it changes, the diagnostic that observes that change, and the operational outcome that justifies deployment. Composite scores are useful only when their weights reflect real business priorities and their components remain separately visible. Otherwise, they are merely spreadsheets wearing authority. The dashboard is not the system AI evaluation has developed an awkward habit: optimize a convenient number, improve that number, and declare the system improved. ...

FLARE Without Fireworks: Diffusion Speed Needs an Autoregressive Anchor

TL;DR for operators FLARE is not a “diffusion models are faster, therefore rejoice” paper. That would be convenient. Also wrong. The paper shows a practical conversion recipe for taking strong hybrid-attention autoregressive LLM checkpoints and giving them a diffusion-style parallel generation path without throwing away the original causal behavior.1 The important move is not one trick. It is a coupled mechanism: a clean autoregressive stream anchors the model’s inherited capability, a noisy diffusion stream learns block-level denoising, document-packed masking prevents examples from leaking into one another, recurrent-state scheduling makes hybrid attention behave under non-causal visibility, and a unified serving stack lets one checkpoint run in two decoding modes. ...

Mixed Feelings: When LLM Batching Stops Being Obviously Better

Mixed Feelings: When LLM Batching Stops Being Obviously Better Queues are where infrastructure theories go to become invoices. In LLM serving, the popular theory has been simple enough: mix the work. During inference, a model first reads the prompt in the prefill phase, then generates tokens one by one in the decode phase. Prefill wants compute. Decode wants memory bandwidth. So the obvious move is to combine them in the same batch, letting one part of the GPU do prefill while another part handles decode. This is mixed batching, and it has become the default posture in modern inference engines. ...

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x Cost has a way of making architecture fashionable. Mixture-of-Experts models became attractive because they promise a pleasant bargain: keep a large total parameter count, but activate only a small part of the model for each token. In business language, that sounds like capacity without the full compute bill. In engineering language, it means routing each token to a few expert feed-forward networks instead of running every expert all the time. ...

No Free Tokens: The New Economics of LLM Inference

Opening — Why this matters now For the last few years, AI strategy has been narrated as a model-quality story: bigger models, better benchmarks, longer context windows, more agents, more demos, more adjectives. That story was useful. It was also incomplete. The less glamorous reality is now arriving with the invoice attached. LLM systems are not merely models. They are production services that consume GPU memory, scheduling capacity, engineering attention, and operational patience. Once a business moves from a prototype to repeated daily use, the question changes from “Can the model answer?” to “Can the system answer reliably, cheaply, and repeatedly when real users arrive at inconvenient times?” ...

Flash Before the First Token: How FlashPrefill Rewrites the Economics of Long Context

Waiting is the least glamorous part of AI. A user uploads a contract, a codebase, a board pack, or a pile of research notes. The model does not answer immediately. First, it reads. Technically, it prefills: it processes the prompt, builds the internal key-value cache, and prepares the first generated token. In short prompts this feels invisible. In long-context systems, it becomes the awkward pause where the “agent” looks suspiciously like a very expensive loading spinner. ...

Tokens, Watts, and Waste: The Hidden Energy Bill of LLM Inference

Tokens are small. That is why they are dangerous. A developer asks an assistant to generate a function, explain a repository, or reason through a failing test. The screen fills with text. Some of it is useful. Some of it is decoration. Some of it is a polite little parade of examples, test cases, alternative implementations, or whitespace that will be thrown away by the next parser in the pipeline. ...

Memory Over Matter: How MemAgent Redefines Long-Context Reasoning with Reinforcement Learning

TL;DR for operators MemAgent is not another “look, we made the context window enormous” paper. Thank goodness; the context-window arms race was starting to look like cloud billing cosplay. The paper’s core move is simpler and more interesting: take a standard dense transformer, let it read a long document in chunks, and force it to maintain a fixed 1024-token working memory. After each chunk, the model overwrites that memory. At the end, it answers using the problem and the memory, not the whole document. The authors then train this behaviour with reinforcement learning, so the model learns what to retain, what to discard, and when a piece of information is merely shiny garbage. ...