Cover image

Pocket Experts: MobileMoE and the Memory Math of On-Device AI

Phones have memory. They also have batteries, thermal limits, app sandboxes, operating-system overhead, impatient users, and the charming habit of becoming hand warmers when developers pretend they are cloud GPUs with a smaller logo. That is the business problem behind MobileMoE, a paper that studies whether Mixture-of-Experts language models can work in the sub-billion-active-parameter regime for on-device deployment.1 The usual MoE story belongs to giant models: add many experts, activate a few, keep per-token compute low, and let the cloud hardware worry about the rest. MobileMoE asks a less fashionable but more commercially useful question: can the same sparse principle survive inside the memory and latency budget of a smartphone? ...

June 6, 2026 · 14 min · Zelina
Cover image

Claw and Order: Why AI Agents Need a Precision Budget

Opening — Why this matters now AI agents are leaving the demo cage. They are no longer just politely completing prompts; they are planning workflows, calling tools, reading files, coordinating intermediate steps, and accumulating context like a bureaucrat hoarding PDFs. This is useful. It is also expensive. The paper “QuantClaw: Precision Where It Matters for OpenClaw” studies a problem that sounds technical but is really managerial: agent systems often run every task at a fixed numerical precision, even though not every task deserves the same computational budget.1 A safety-critical terminal command and a lightweight retrieval summary are not the same species of work. Treating them identically is the infrastructure equivalent of sending a limousine to deliver printer paper. ...

April 27, 2026 · 11 min · Zelina
Cover image

Compress, Then Confess: Why Order Beats Method in AI Model Efficiency

A deployment team has a large model, a smaller device, and a familiar problem: the model is too heavy for the place where the business actually wants to use it. So the team reaches for the standard efficiency drawer. Prune some weights. Quantize the remaining values. Maybe add a light adapter to recover accuracy. Push the result to edge hardware, a mobile app, or a cheaper inference server. Then explain to management why the model became faster but also slightly less intelligent. The usual ritual. ...

March 21, 2026 · 20 min · Zelina
Cover image

When Tokens Explode: The Hidden Geometry Behind Attention Sinks

Serving an LLM is usually discussed in pleasantly managerial language: latency, throughput, context windows, GPU memory, quantization, cache eviction. Nice clean nouns. Then the model ruins the spreadsheet by producing internal activations that are thousands of times larger than ordinary values, while some tokens quietly become attention magnets for reasons that are not exactly semantic. Very professional behavior from a trillion-dollar technology stack. ...

March 6, 2026 · 16 min · Zelina
Cover image

Small Models, Big Mouths: Why Game AI Doesn’t Need Giant Brains

Game AI has a very ordinary problem: it has to work while the player is waiting. Not eventually. Not after a cloud round trip. Not after an impressive model has finished contemplating the metaphysics of medieval tavern gossip. In a game, intelligence has to fit inside latency budgets, memory budgets, design constraints, and the deeply unromantic fact that many players expect single-player games to work offline. ...

February 3, 2026 · 17 min · Zelina
Cover image

Rotate Less, Quantize Better: OptRot and the Geometry of LLM Compression

Packing is easy until one object is much larger than everything else. A warehouse can fit hundreds of ordinary boxes onto neatly spaced shelves. Add one grand piano, however, and the spacing plan becomes rather less elegant. Either the piano does not fit, or every shelf is redesigned around an object that appears once. ...

January 3, 2026 · 16 min · Zelina
Cover image

TOGGLE or Die Trying: Giving LLM Compression a Spine

Compression needs a rulebook, not just a diet plan Compression is the least glamorous part of the LLM business until the bill arrives. A model works beautifully in a cloud demo. Then someone asks whether it can run on a device with limited memory, limited energy, limited connectivity, and limited patience. Suddenly the elegant system becomes a logistics problem. Quantize it. Prune it. Shrink it. Hope it still speaks like the original model and not like a sleep-deprived intern summarizing a legal contract from memory. ...

December 19, 2025 · 14 min · Zelina
Cover image

Unsafe at Any Bit: Patching the Safety Gaps in Quantized LLMs

TL;DR for operators Quantizing an LLM is not a harmless cost-saving step. It changes the model, and the paper analysed here shows that those changes can weaken safety even when familiar utility scores still look respectable. That is the uncomfortable part: the dashboard can say “performance preserved” while the model has become more willing to comply with harmful requests. Very efficient. Very modern. Very easy to miss. ...

June 26, 2025 · 20 min · Zelina