Quantization

Measure for Measure: Why AI Evaluation Must Follow the Failure

TL;DR for operators A lower model bit width is not automatically a speedup. A lower training loss is not automatically a reliable policy. GRINQH evaluates quantization through the mechanism it is meant to change: decoding-stage memory traffic, kernel throughput, end-to-end generation speed, and retained task accuracy.1 Kolmogorov regression evaluates diffusion policies through trajectory geometry, a PDE-based inference residual, rollout behavior, anomaly detection, and an external safety filter.2 The shared lesson is not that the two forms of “precision” are technically equivalent. They are not. The lesson is that fidelity and evidence should be allocated according to the actual failure structure of the system. A production evaluation should connect four things explicitly: the intervention, the mechanism it changes, the diagnostic that observes that change, and the operational outcome that justifies deployment. Composite scores are useful only when their weights reflect real business priorities and their components remain separately visible. Otherwise, they are merely spreadsheets wearing authority. The dashboard is not the system AI evaluation has developed an awkward habit: optimize a convenient number, improve that number, and declare the system improved. ...

Measure Twice, Quantize Once

TL;DR for operators Compression is usually sold as a tidy pipeline: pick a smaller architecture, prune some layers, quantize the result, then call procurement and explain why the GPU bill is still rude. This paper argues that the pipeline itself is the problem.1 The authors propose a joint compression framework for Llama-3.1-8B that searches architectural choices and quantization choices together. That means the system does not first decide “how much model” it wants and only afterward decide “how many bits” each part deserves. It treats width, depth, layer importance, weight precision, activation precision, and latency as interacting deployment variables. ...

The Model Got Smaller. The Risk Got Wider.

TL;DR for operators Compression is usually sold as a clean engineering bargain: smaller model, lower memory, cheaper inference, acceptable accuracy loss. This paper asks the more operationally annoying question: after compression, does the model still know when it should hedge? The answer is: not reliably. Tong et al. benchmark compressed LLMs using conformal prediction, a framework that converts model probabilities into prediction sets with target coverage.1 In this setup, the important uncertainty metric is prediction set size: if the model needs to include more answer options to maintain coverage, it is less certain, even if its top-1 accuracy still looks respectable. ...

LoRA Was Supposed to Fit on the Edge. The Activations Disagreed.

TL;DR for operators LoRA does not magically make LLM fine-tuning fit on phones, laptops, or small edge boxes. It reduces the number of trainable parameters. The paper’s useful contribution is showing that this is only the opening move. The real memory bill arrives from activations, checkpoint boundaries, vocabulary-sized output computations, and tokens that are being processed even though they do not contribute to the loss. Apparently the memory allocator did not attend the product strategy meeting. ...

Four Bits, One Identity Crisis: What W4A4 Video Quantization Actually Breaks

TL;DR for operators The useful surprise in Tail-Aware HiFloat4 is not that a 4-bit video model gets worse. That part is not exactly a Nobel-level plot twist. The useful surprise is where it gets worse. The paper reports a W4A4 HiFloat4 post-training quantization pipeline for Wan2.2-I2V-A14B, and under matched generation settings the unweighted mean score drops from 0.6800 to 0.5880. But the collapse is concentrated: subject consistency falls from 0.9331 to 0.5324, while aesthetic quality is effectively unchanged, overall consistency is comparable, and motion smoothness drops only slightly from 0.9923 to 0.9803.1 ...

$Cover image$

Pocket Experts: MobileMoE and the Memory Math of On-Device AI

Phones have memory. They also have batteries, thermal limits, app sandboxes, operating-system overhead, impatient users, and the charming habit of becoming hand warmers when developers pretend they are cloud GPUs with a smaller logo. That is the business problem behind MobileMoE, a paper that studies whether Mixture-of-Experts language models can work in the sub-billion-active-parameter regime for on-device deployment.1 The usual MoE story belongs to giant models: add many experts, activate a few, keep per-token compute low, and let the cloud hardware worry about the rest. MobileMoE asks a less fashionable but more commercially useful question: can the same sparse principle survive inside the memory and latency budget of a smartphone? ...

Claw and Order: Why AI Agents Need a Precision Budget

Opening — Why this matters now AI agents are leaving the demo cage. They are no longer just politely completing prompts; they are planning workflows, calling tools, reading files, coordinating intermediate steps, and accumulating context like a bureaucrat hoarding PDFs. This is useful. It is also expensive. The paper “QuantClaw: Precision Where It Matters for OpenClaw” studies a problem that sounds technical but is really managerial: agent systems often run every task at a fixed numerical precision, even though not every task deserves the same computational budget.1 A safety-critical terminal command and a lightweight retrieval summary are not the same species of work. Treating them identically is the infrastructure equivalent of sending a limousine to deliver printer paper. ...

Compress, Then Confess: Why Order Beats Method in AI Model Efficiency

A deployment team has a large model, a smaller device, and a familiar problem: the model is too heavy for the place where the business actually wants to use it. So the team reaches for the standard efficiency drawer. Prune some weights. Quantize the remaining values. Maybe add a light adapter to recover accuracy. Push the result to edge hardware, a mobile app, or a cheaper inference server. Then explain to management why the model became faster but also slightly less intelligent. The usual ritual. ...

When Tokens Explode: The Hidden Geometry Behind Attention Sinks

Serving an LLM is usually discussed in pleasantly managerial language: latency, throughput, context windows, GPU memory, quantization, cache eviction. Nice clean nouns. Then the model ruins the spreadsheet by producing internal activations that are thousands of times larger than ordinary values, while some tokens quietly become attention magnets for reasons that are not exactly semantic. Very professional behavior from a trillion-dollar technology stack. ...

Small Models, Big Mouths: Why Game AI Doesn’t Need Giant Brains

Game AI has a very ordinary problem: it has to work while the player is waiting. Not eventually. Not after a cloud round trip. Not after an impressive model has finished contemplating the metaphysics of medieval tavern gossip. In a game, intelligence has to fit inside latency budgets, memory budgets, design constraints, and the deeply unromantic fact that many players expect single-player games to work offline. ...