Cover image

Beam Me Less, Scotty: MoE Models Learn When Not to Call Every Expert

Latency has a way of turning elegant model architecture into an invoice. Mixture-of-Experts models were supposed to soften that invoice. Instead of sending every token through the same dense feed-forward machinery, an MoE layer sends each token to only a few experts. In theory, this gives us scale without paying for all parameters on every token. In practice, many deployed MoE models still behave like a restaurant that insists every guest order the same number of dishes. The experts differ, but the billable count is fixed. ...

June 4, 2026 · 15 min · Zelina
Cover image

Expert Witness: How MoE Translation Models Can Lose Weight Without Losing the Plot

Translation is one of those AI workloads where scale is both a blessing and a tax. A large language model can translate with impressive robustness, follow instructions, preserve formatting, and handle messy inputs better than many older systems. Then the bill arrives. The model is not only carrying translation ability; it is also carrying mathematical reasoning, factual memory, coding patterns, roleplay habits, tool-use affordances, and several other things that are not exactly required to turn German into English. ...

June 4, 2026 · 17 min · Zelina
Cover image

Filter Bubble Bursts: When Common Crawl Beats Clean Data

Cleaning is comforting. Every serious AI team has some version of the same ritual. Remove spam. Remove repetition. Remove bad language detection. Remove low-quality pages. Remove documents that look too weird, too short, too duplicated, too uneducational, too internet. Then hope the model learns from the respectable leftovers. That instinct is not foolish. In small or compute-constrained training runs, filtering often helps. The expensive mistake is treating that local truth as a permanent law. ...

June 4, 2026 · 14 min · Zelina
Cover image

Cache Me If You Can: Why LLM Benchmarks Need Contamination-Resistant Data

The benchmark score is not the product. The test pipeline is. Benchmarks used to feel like neutral scoreboards. A model sat down, answered questions, received a number, and everyone pretended the number meant generalization. That story became less charming once benchmark questions started appearing in the same public data oceans used to train the models being tested. ...

June 3, 2026 · 20 min · Zelina
Cover image

K-Means, K-Gone: Sparse Coding and the Retrieval Bottleneck

Indexing is where many retrieval systems quietly become expensive. The demo looks harmless: upload documents, create embeddings, ask questions, receive answers with citations. Then the corpus starts behaving like a real business corpus. Policies change. Product pages are rewritten. Compliance documents are replaced. Support tickets arrive every hour. The retrieval layer must keep up, and suddenly the glamorous RAG stack is waiting for the plumbing to rebuild itself. As usual, the least photogenic component is the one holding the invoice. ...

June 2, 2026 · 21 min · Zelina
Cover image

Don’t Average the Needle: Spectral Retrieval and the RAG Evidence Problem

Enterprise search has a very old habit wearing a very modern jacket: it averages. A policy document becomes one vector. A runbook becomes one vector. A postmortem full of operational detail becomes one vector. Then a RAG system asks that one vector whether the document is relevant. This is convenient, fast, and usually defensible — until the relevant answer is a narrow paragraph hiding inside a large document. At that point, the retrieval system is no longer searching for evidence. It is asking a crowd to speak for the witness. ...

May 30, 2026 · 16 min · Zelina
Cover image

Energy Bills for Transformers: CEM Makes Layer Design Less Empirical

Weights are expensive twice. First, they cost money to train. Then they cost money every time a model is served, copied, quantized, tuned, monitored, and occasionally blamed for a cloud bill that no one wants to read twice. This is why every architecture paper with the words “efficient,” “low-rank,” “shared,” or “recursive” immediately attracts attention. Some of that attention is deserved. Some of it is merely the industry’s permanent hunger for a cheaper miracle with a nicer benchmark table. ...

May 27, 2026 · 14 min · Zelina
Cover image

The Edge Case for LLM Routing: Why Cheap Local Inference Needs a Risk Gate

Phone. That is the simplest way to understand the problem. Not “AI infrastructure,” not “distributed inference,” not the usual diagram where a cloud box smiles down upon a client device. A phone receives a query. It must decide whether to answer locally or send the request to an edge server. Once it answers locally, the decision is done. There is no elegant after-the-fact escalation. The stronger model it did not call remains unused, quietly judging from the rack. ...

May 27, 2026 · 15 min · Zelina
Cover image

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x Cost has a way of making architecture fashionable. Mixture-of-Experts models became attractive because they promise a pleasant bargain: keep a large total parameter count, but activate only a small part of the model for each token. In business language, that sounds like capacity without the full compute bill. In engineering language, it means routing each token to a few expert feed-forward networks instead of running every expert all the time. ...

May 27, 2026 · 16 min · Zelina
Cover image

The KV Cache Is Not a Detail: Why LLM Compression Needs a Control Plane

Bandwidth is one of those infrastructure costs that looks boring until it becomes the product bottleneck. A retrieval-augmented assistant gets a long document. An agentic workflow accumulates tool traces. A support chatbot reuses a large system prompt and a customer-history prefix. The model may be fast enough, the GPUs may be expensive enough, and yet the user still waits. Not because the model is thinking harder. Because the system is moving state. ...

May 27, 2026 · 15 min · Zelina