Cover image

Mixed Feelings: When LLM Batching Stops Being Obviously Better

Mixed Feelings: When LLM Batching Stops Being Obviously Better Queues are where infrastructure theories go to become invoices. In LLM serving, the popular theory has been simple enough: mix the work. During inference, a model first reads the prompt in the prefill phase, then generates tokens one by one in the decode phase. Prefill wants compute. Decode wants memory bandwidth. So the obvious move is to combine them in the same batch, letting one part of the GPU do prefill while another part handles decode. This is mixed batching, and it has become the default posture in modern inference engines. ...

June 13, 2026 · 19 min · Zelina
Cover image

Stale Gradients, Fresh Economics: CoCD’s Lightweight Route to Zeroth-Order AI

Memory is usually treated as a luxury in machine learning. More parameters, more activations, more optimiser state, more logs, more everything. Then the invoice arrives, the device overheats, and someone rediscovers the ancient corporate virtue of not wasting things. The paper Turning Stale Gradients into Stable Gradients makes a modest but interesting proposal: perhaps an optimiser should not throw away old gradient information just because it is old.1 In the right setting, yesterday’s partial derivative is not spoiled milk. It is a slightly outdated map. If the terrain has not shifted too violently, it may still point in a useful direction. ...

June 13, 2026 · 16 min · Zelina
Cover image

Copy Less, Catch More: The Minimal Surface Rule for Production AI

Copy Less, Catch More: The Minimal Surface Rule for Production AI Production AI has a slightly embarrassing habit: the more intelligent the system becomes, the more basic the bottleneck starts to look. A coding agent may reason beautifully, then spend its useful life waiting for a sandbox to roll back after one bad command. A model marketplace may offer thousands of “ready-to-deploy” neural networks, then make security review so expensive that nobody checks enough of them. Apparently the future of AI can be blocked by file copies and audit queues. Very glamorous. ...

June 11, 2026 · 17 min · Zelina
Cover image

Mind the Representation Gap: Why Enterprise AI Fails Before It Thinks

Enterprise AI has developed a charming habit: whenever a system fails, someone suggests using a larger model. The chatbot misread a customer complaint? Bigger model. The autonomous system struggled with a new sensor configuration? Bigger model. The video classifier understood the objects but missed the actual message? Bigger model, possibly with a more expensive logo. ...

June 11, 2026 · 14 min · Zelina
Cover image

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below Enterprise AI has entered its awkward teenage years. It wants to be autonomous, helpful, context-aware, cheap, safe, fast, auditable, and preferably not the reason the legal department starts drinking before lunch. That is a lot to ask from “just use a bigger model.” ...

June 9, 2026 · 15 min · Zelina
Cover image

MoE Than a Cost Trick: How Sparse Experts Became an Architecture Stack

The old business pitch for Mixture-of-Experts was satisfyingly simple: activate fewer parameters, spend less compute, keep more capacity on the shelf. It sounded like cloud cost optimization with a PhD. Useful, but not exactly poetic. The newer story is more interesting. Three recent arXiv papers—DOT-MoE, DAG-MoE, and LoopMoE—suggest that MoE is no longer just a sparsity trick. It is becoming an architecture stack for conditional computation: first decide how experts are formed, then how selected experts interact, and finally how sparse expert systems can be reused over iterative depth.123 ...

June 7, 2026 · 13 min · Zelina
Cover image

Pocket Experts: MobileMoE and the Memory Math of On-Device AI

Phones have memory. They also have batteries, thermal limits, app sandboxes, operating-system overhead, impatient users, and the charming habit of becoming hand warmers when developers pretend they are cloud GPUs with a smaller logo. That is the business problem behind MobileMoE, a paper that studies whether Mixture-of-Experts language models can work in the sub-billion-active-parameter regime for on-device deployment.1 The usual MoE story belongs to giant models: add many experts, activate a few, keep per-token compute low, and let the cloud hardware worry about the rest. MobileMoE asks a less fashionable but more commercially useful question: can the same sparse principle survive inside the memory and latency budget of a smartphone? ...

June 6, 2026 · 14 min · Zelina
Cover image

State of Delay: KVBuffer and the Memory Tax of Linear Attention

Latency has a habit of hiding inside words that sound efficient. “Constant decoding cost” is one of those phrases. It suggests a clean engineering promise: linear attention avoids the context-length explosion of softmax attention, so long-context inference should become simpler, cheaper, and less melodramatic. Very nice. The GPU accountants, however, have not retired. ...

June 6, 2026 · 15 min · Zelina
Cover image

No Cluster Is an Island: ScaleAcross Explorer and the Geography Tax of AI Training

GPUs used to have a simple business story: buy more, wire them well, train bigger models. That story is not false. It is just starting to resemble a children’s book. The adult version has buildings, regions, power constraints, optical links, oversubscribed networks, packet loss, pipeline bubbles, model chunks, microbatches, and a quiet question with a very expensive answer: when the GPUs no longer fit comfortably inside one data center building, how should the training job be split? ...

June 5, 2026 · 18 min · Zelina
Cover image

One Pass to Forecast Them All: Toto 2.0 and the Scaling Recipe for Time-Series AI

Forecasting is where machine learning often learns humility. A language model can sound clever while being wrong. A forecasting model has fewer hiding places. Revenue arrives or it does not. CPU saturation happens or it does not. Demand spikes, latency drifts, inventories rot, turbines fail, and the spreadsheet smiles politely before punishing everyone involved. This is why time-series foundation models have been treated with a particular kind of suspicion: useful, interesting, sometimes impressive, but not yet comfortably scalable in the way large language models became scalable. ...

June 5, 2026 · 18 min · Zelina