Cover image

Expert Witness: How MoE Translation Models Can Lose Weight Without Losing the Plot

Translation is one of those AI workloads where scale is both a blessing and a tax. A large language model can translate with impressive robustness, follow instructions, preserve formatting, and handle messy inputs better than many older systems. Then the bill arrives. The model is not only carrying translation ability; it is also carrying mathematical reasoning, factual memory, coding patterns, roleplay habits, tool-use affordances, and several other things that are not exactly required to turn German into English. ...

June 4, 2026 · 17 min · Zelina
Cover image

No Free Tokens: The New Economics of LLM Inference

Opening — Why this matters now For the last few years, AI strategy has been narrated as a model-quality story: bigger models, better benchmarks, longer context windows, more agents, more demos, more adjectives. That story was useful. It was also incomplete. The less glamorous reality is now arriving with the invoice attached. LLM systems are not merely models. They are production services that consume GPU memory, scheduling capacity, engineering attention, and operational patience. Once a business moves from a prototype to repeated daily use, the question changes from “Can the model answer?” to “Can the system answer reliably, cheaply, and repeatedly when real users arrive at inconvenient times?” ...

May 7, 2026 · 16 min · Zelina
Cover image

Rank and File: Why LoRA Adapters May Be Bigger Than They Need to Be

Opening — Why this matters now Fine-tuning large models used to sound like a research luxury. Now it is a line item in the infrastructure budget. Enterprises do not want one general-purpose model behaving vaguely usefully for everyone. They want domain-specific behavior: a support adapter for insurance claims, a compliance adapter for legal review, a financial-document adapter for analyst workflows, perhaps a dozen regional variants, and then another dozen because someone discovered “brand tone” during a steering committee meeting. Naturally. ...

May 4, 2026 · 12 min · Zelina
Cover image

Compress, Then Confess: Why Order Beats Method in AI Model Efficiency

A deployment team has a large model, a smaller device, and a familiar problem: the model is too heavy for the place where the business actually wants to use it. So the team reaches for the standard efficiency drawer. Prune some weights. Quantize the remaining values. Maybe add a light adapter to recover accuracy. Push the result to edge hardware, a mobile app, or a cheaper inference server. Then explain to management why the model became faster but also slightly less intelligent. The usual ritual. ...

March 21, 2026 · 20 min · Zelina
Cover image

When 256 Dimensions Pretend to Be 16: The Quiet Overengineering of Vision-Language Segmentation

A prompt is usually a small thing. “White dog.” “Person in a blue jacket.” “Cup on the table.” Nobody hears these phrases and thinks: excellent, time to deploy a large general-purpose language encoder. Yet that is often what modern vision-language segmentation systems do. The visual model may be carefully optimized. The deployment team may obsess over image encoder latency, GPU memory, and batch size. Then the text side sits there, inherited from a larger foundation model stack, quietly burning capacity to understand what is often a noun phrase with a color adjective attached. Very sophisticated machinery, bravely parsing “red car.” Heroic. ...

February 13, 2026 · 15 min · Zelina
Cover image

Routing the Lottery: When Pruning Learns to Choose

A model can be small and still be badly organized. That is the quiet problem behind a lot of model compression work. We often ask whether a neural network can be pruned without losing too much accuracy. Fair enough. Budgets are real. Memory is not decorative. But the question hides a stronger assumption: that one sparse structure should serve every input equally well. ...

January 30, 2026 · 18 min · Zelina
Cover image

FAQ It Till You Make It: Fixing LLM Quantization by Teaching Models Their Own Family History

Compression sounds simple until the model starts forgetting how to think. A deployment team takes a large language model, squeezes its weights into lower precision, saves memory, improves serving economics, and expects the model to behave like a slightly thinner version of itself. Then INT4 arrives with a polite smile and removes just enough reasoning ability to make the business case awkward. The model still answers. It still looks fluent. It just becomes less reliable exactly where the product needed it to stay sharp. ...

January 20, 2026 · 17 min · Zelina
Cover image

Pruning Is a Game, and Most Weights Lose

Pruning Is a Game, and Most Weights Lose Pruning usually sounds like housekeeping. Train the model. Rank the weights. Remove the small ones. Fine-tune the survivor. Pretend the whole exercise was more scientific than it looked in the notebook. That workflow has worked well enough to become familiar. But familiarity is not explanation. It tells us how to remove model components after training; it says less about why some components become removable in the first place. The paper Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks asks a sharper question: what if pruning is not merely an external compression operation, but the outcome of competition inside the model?1 ...

December 29, 2025 · 15 min · Zelina
Cover image

Graft and Go: How Knowledge Grafting Shrinks AI Without Shrinking Its Brain

TL;DR for operators A field robot does not care that your neural network is elegant. It cares whether the model fits on the device, runs without draining the battery, and still recognises the weed before the sprayer makes an expensive little mistake. The paper introduces knowledge grafting, a mechanism for taking selected intermediate features from a larger donor model and attaching them to a smaller deployable model, called the rootstock.1 In the reported DeepWeeds experiment, the authors reduce a VGG16-derived model from 64.39 MB to 7.38 MB, cutting parameters from 16,880,201 to 1,934,665, while reporting 90.45% test accuracy on unseen images. ...

July 28, 2025 · 15 min · Zelina
Cover image

Unsafe at Any Bit: Patching the Safety Gaps in Quantized LLMs

TL;DR for operators Quantizing an LLM is not a harmless cost-saving step. It changes the model, and the paper analysed here shows that those changes can weaken safety even when familiar utility scores still look respectable. That is the uncomfortable part: the dashboard can say “performance preserved” while the model has become more willing to comply with harmful requests. Very efficient. Very modern. Very easy to miss. ...

June 26, 2025 · 20 min · Zelina