Model Compression

Four Bits, One Identity Crisis: What W4A4 Video Quantization Actually Breaks

TL;DR for operators The useful surprise in Tail-Aware HiFloat4 is not that a 4-bit video model gets worse. That part is not exactly a Nobel-level plot twist. The useful surprise is where it gets worse. The paper reports a W4A4 HiFloat4 post-training quantization pipeline for Wan2.2-I2V-A14B, and under matched generation settings the unweighted mean score drops from 0.6800 to 0.5880. But the collapse is concentrated: subject consistency falls from 0.9331 to 0.5324, while aesthetic quality is effectively unchanged, overall consistency is comparable, and motion smoothness drops only slightly from 0.9923 to 0.9803.1 ...

Expert Witness: How MoE Translation Models Can Lose Weight Without Losing the Plot

Translation is one of those AI workloads where scale is both a blessing and a tax. A large language model can translate with impressive robustness, follow instructions, preserve formatting, and handle messy inputs better than many older systems. Then the bill arrives. The model is not only carrying translation ability; it is also carrying mathematical reasoning, factual memory, coding patterns, roleplay habits, tool-use affordances, and several other things that are not exactly required to turn German into English. ...

No Free Tokens: The New Economics of LLM Inference

Opening — Why this matters now For the last few years, AI strategy has been narrated as a model-quality story: bigger models, better benchmarks, longer context windows, more agents, more demos, more adjectives. That story was useful. It was also incomplete. The less glamorous reality is now arriving with the invoice attached. LLM systems are not merely models. They are production services that consume GPU memory, scheduling capacity, engineering attention, and operational patience. Once a business moves from a prototype to repeated daily use, the question changes from “Can the model answer?” to “Can the system answer reliably, cheaply, and repeatedly when real users arrive at inconvenient times?” ...

Rank and File: Why LoRA Adapters May Be Bigger Than They Need to Be

Opening — Why this matters now Fine-tuning large models used to sound like a research luxury. Now it is a line item in the infrastructure budget. Enterprises do not want one general-purpose model behaving vaguely usefully for everyone. They want domain-specific behavior: a support adapter for insurance claims, a compliance adapter for legal review, a financial-document adapter for analyst workflows, perhaps a dozen regional variants, and then another dozen because someone discovered “brand tone” during a steering committee meeting. Naturally. ...

Compress, Then Confess: Why Order Beats Method in AI Model Efficiency

A deployment team has a large model, a smaller device, and a familiar problem: the model is too heavy for the place where the business actually wants to use it. So the team reaches for the standard efficiency drawer. Prune some weights. Quantize the remaining values. Maybe add a light adapter to recover accuracy. Push the result to edge hardware, a mobile app, or a cheaper inference server. Then explain to management why the model became faster but also slightly less intelligent. The usual ritual. ...

When 256 Dimensions Pretend to Be 16: The Quiet Overengineering of Vision-Language Segmentation

A prompt is usually a small thing. “White dog.” “Person in a blue jacket.” “Cup on the table.” Nobody hears these phrases and thinks: excellent, time to deploy a large general-purpose language encoder. Yet that is often what modern vision-language segmentation systems do. The visual model may be carefully optimized. The deployment team may obsess over image encoder latency, GPU memory, and batch size. Then the text side sits there, inherited from a larger foundation model stack, quietly burning capacity to understand what is often a noun phrase with a color adjective attached. Very sophisticated machinery, bravely parsing “red car.” Heroic. ...

Routing the Lottery: When Pruning Learns to Choose

A model can be small and still be badly organized. That is the quiet problem behind a lot of model compression work. We often ask whether a neural network can be pruned without losing too much accuracy. Fair enough. Budgets are real. Memory is not decorative. But the question hides a stronger assumption: that one sparse structure should serve every input equally well. ...

FAQ It Till You Make It: Fixing LLM Quantization by Teaching Models Their Own Family History

Compression sounds simple until the model starts forgetting how to think. A deployment team takes a large language model, squeezes its weights into lower precision, saves memory, improves serving economics, and expects the model to behave like a slightly thinner version of itself. Then INT4 arrives with a polite smile and removes just enough reasoning ability to make the business case awkward. The model still answers. It still looks fluent. It just becomes less reliable exactly where the product needed it to stay sharp. ...

Pruning Is a Game, and Most Weights Lose

Pruning Is a Game, and Most Weights Lose Pruning usually sounds like housekeeping. Train the model. Rank the weights. Remove the small ones. Fine-tune the survivor. Pretend the whole exercise was more scientific than it looked in the notebook. That workflow has worked well enough to become familiar. But familiarity is not explanation. It tells us how to remove model components after training; it says less about why some components become removable in the first place. The paper Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks asks a sharper question: what if pruning is not merely an external compression operation, but the outcome of competition inside the model?1 ...

Graft and Go: How Knowledge Grafting Shrinks AI Without Shrinking Its Brain

TL;DR for operators A field robot does not care that your neural network is elegant. It cares whether the model fits on the device, runs without draining the battery, and still recognises the weed before the sprayer makes an expensive little mistake. The paper introduces knowledge grafting, a mechanism for taking selected intermediate features from a larger donor model and attaching them to a smaller deployable model, called the rootstock.1 In the reported DeepWeeds experiment, the authors reduce a VGG16-derived model from 64.39 MB to 7.38 MB, cutting parameters from 16,880,201 to 1,934,665, while reporting 90.45% test accuracy on unseen images. ...