AI Infrastructure

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below Enterprise AI has entered its awkward teenage years. It wants to be autonomous, helpful, context-aware, cheap, safe, fast, auditable, and preferably not the reason the legal department starts drinking before lunch. That is a lot to ask from “just use a bigger model.” ...

MoE Than a Cost Trick: How Sparse Experts Became an Architecture Stack

The old business pitch for Mixture-of-Experts was satisfyingly simple: activate fewer parameters, spend less compute, keep more capacity on the shelf. It sounded like cloud cost optimization with a PhD. Useful, but not exactly poetic. The newer story is more interesting. Three recent arXiv papers—DOT-MoE, DAG-MoE, and LoopMoE—suggest that MoE is no longer just a sparsity trick. It is becoming an architecture stack for conditional computation: first decide how experts are formed, then how selected experts interact, and finally how sparse expert systems can be reused over iterative depth.123 ...

$Cover image$

Pocket Experts: MobileMoE and the Memory Math of On-Device AI

Phones have memory. They also have batteries, thermal limits, app sandboxes, operating-system overhead, impatient users, and the charming habit of becoming hand warmers when developers pretend they are cloud GPUs with a smaller logo. That is the business problem behind MobileMoE, a paper that studies whether Mixture-of-Experts language models can work in the sub-billion-active-parameter regime for on-device deployment.1 The usual MoE story belongs to giant models: add many experts, activate a few, keep per-token compute low, and let the cloud hardware worry about the rest. MobileMoE asks a less fashionable but more commercially useful question: can the same sparse principle survive inside the memory and latency budget of a smartphone? ...

State of Delay: KVBuffer and the Memory Tax of Linear Attention

Latency has a habit of hiding inside words that sound efficient. “Constant decoding cost” is one of those phrases. It suggests a clean engineering promise: linear attention avoids the context-length explosion of softmax attention, so long-context inference should become simpler, cheaper, and less melodramatic. Very nice. The GPU accountants, however, have not retired. ...

No Cluster Is an Island: ScaleAcross Explorer and the Geography Tax of AI Training

GPUs used to have a simple business story: buy more, wire them well, train bigger models. That story is not false. It is just starting to resemble a children’s book. The adult version has buildings, regions, power constraints, optical links, oversubscribed networks, packet loss, pipeline bubbles, model chunks, microbatches, and a quiet question with a very expensive answer: when the GPUs no longer fit comfortably inside one data center building, how should the training job be split? ...

One Pass to Forecast Them All: Toto 2.0 and the Scaling Recipe for Time-Series AI

Forecasting is where machine learning often learns humility. A language model can sound clever while being wrong. A forecasting model has fewer hiding places. Revenue arrives or it does not. CPU saturation happens or it does not. Demand spikes, latency drifts, inventories rot, turbines fail, and the spreadsheet smiles politely before punishing everyone involved. This is why time-series foundation models have been treated with a particular kind of suspicion: useful, interesting, sometimes impressive, but not yet comfortably scalable in the way large language models became scalable. ...

Beam Me Less, Scotty: MoE Models Learn When Not to Call Every Expert

Latency has a way of turning elegant model architecture into an invoice. Mixture-of-Experts models were supposed to soften that invoice. Instead of sending every token through the same dense feed-forward machinery, an MoE layer sends each token to only a few experts. In theory, this gives us scale without paying for all parameters on every token. In practice, many deployed MoE models still behave like a restaurant that insists every guest order the same number of dishes. The experts differ, but the billable count is fixed. ...

Expert Witness: How MoE Translation Models Can Lose Weight Without Losing the Plot

Translation is one of those AI workloads where scale is both a blessing and a tax. A large language model can translate with impressive robustness, follow instructions, preserve formatting, and handle messy inputs better than many older systems. Then the bill arrives. The model is not only carrying translation ability; it is also carrying mathematical reasoning, factual memory, coding patterns, roleplay habits, tool-use affordances, and several other things that are not exactly required to turn German into English. ...

Filter Bubble Bursts: When Common Crawl Beats Clean Data

Cleaning is comforting. Every serious AI team has some version of the same ritual. Remove spam. Remove repetition. Remove bad language detection. Remove low-quality pages. Remove documents that look too weird, too short, too duplicated, too uneducational, too internet. Then hope the model learns from the respectable leftovers. That instinct is not foolish. In small or compute-constrained training runs, filtering often helps. The expensive mistake is treating that local truth as a permanent law. ...

Cache Me If You Can: Why LLM Benchmarks Need Contamination-Resistant Data

The benchmark score is not the product. The test pipeline is. Benchmarks used to feel like neutral scoreboards. A model sat down, answered questions, received a number, and everyone pretended the number meant generalization. That story became less charming once benchmark questions started appearing in the same public data oceans used to train the models being tested. ...