AI Infrastructure

K-Means, K-Gone: Sparse Coding and the Retrieval Bottleneck

Indexing is where many retrieval systems quietly become expensive. The demo looks harmless: upload documents, create embeddings, ask questions, receive answers with citations. Then the corpus starts behaving like a real business corpus. Policies change. Product pages are rewritten. Compliance documents are replaced. Support tickets arrive every hour. The retrieval layer must keep up, and suddenly the glamorous RAG stack is waiting for the plumbing to rebuild itself. As usual, the least photogenic component is the one holding the invoice. ...

Don’t Average the Needle: Spectral Retrieval and the RAG Evidence Problem

Enterprise search has a very old habit wearing a very modern jacket: it averages. A policy document becomes one vector. A runbook becomes one vector. A postmortem full of operational detail becomes one vector. Then a RAG system asks that one vector whether the document is relevant. This is convenient, fast, and usually defensible — until the relevant answer is a narrow paragraph hiding inside a large document. At that point, the retrieval system is no longer searching for evidence. It is asking a crowd to speak for the witness. ...

Energy Bills for Transformers: CEM Makes Layer Design Less Empirical

Weights are expensive twice. First, they cost money to train. Then they cost money every time a model is served, copied, quantized, tuned, monitored, and occasionally blamed for a cloud bill that no one wants to read twice. This is why every architecture paper with the words “efficient,” “low-rank,” “shared,” or “recursive” immediately attracts attention. Some of that attention is deserved. Some of it is merely the industry’s permanent hunger for a cheaper miracle with a nicer benchmark table. ...

The Edge Case for LLM Routing: Why Cheap Local Inference Needs a Risk Gate

Phone. That is the simplest way to understand the problem. Not “AI infrastructure,” not “distributed inference,” not the usual diagram where a cloud box smiles down upon a client device. A phone receives a query. It must decide whether to answer locally or send the request to an edge server. Once it answers locally, the decision is done. There is no elegant after-the-fact escalation. The stronger model it did not call remains unused, quietly judging from the rack. ...

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x

The Experts Are Sparse Inside: Why MoE Cost Cuts Stop at 1.2x Cost has a way of making architecture fashionable. Mixture-of-Experts models became attractive because they promise a pleasant bargain: keep a large total parameter count, but activate only a small part of the model for each token. In business language, that sounds like capacity without the full compute bill. In engineering language, it means routing each token to a few expert feed-forward networks instead of running every expert all the time. ...

The KV Cache Is Not a Detail: Why LLM Compression Needs a Control Plane

Bandwidth is one of those infrastructure costs that looks boring until it becomes the product bottleneck. A retrieval-augmented assistant gets a long document. An agentic workflow accumulates tool traces. A support chatbot reuses a large system prompt and a customer-history prefix. The model may be fast enough, the GPUs may be expensive enough, and yet the user still waits. Not because the model is thinking harder. Because the system is moving state. ...

AdamW and the Cost of Being Reasonable: Choosing LLM Optimizers Without Leaderboard Theater

GPU memory is the part of AI strategy that does not care about adjectives. A team can say it is building a domain LLM, a private copilot, a long-context research assistant, or a fine-tuned enterprise model. The budget spreadsheet eventually asks a colder question: what actually fits on the available hardware? Model weights need memory. Gradients need memory. Activations need memory. Checkpoints need memory. And the optimizer — the quiet machinery that decides how parameters move during training — can require multiple additional copies of the model itself. ...

Place Your Experts, Not Your Bets

Opening — Why this matters now The fashionable version of AI strategy still sounds suspiciously like a gym membership pitch: bigger model, more parameters, more GPUs, more everything. The operational version is less glamorous and much more important: where does the computation happen, which parts of the model are actually used, how predictable is demand, and whether the system can turn those facts into lower latency, lower cost, or better decisions. ...

$Cover image$

Queue Who’s Optimizing: Why LLM Serving Needs Math, Not More Vibes

Opening — Why this matters now The first wave of enterprise AI adoption was obsessed with model choice. Which model is smarter? Which model writes better? Which model can reason, code, browse, call tools, summarize contracts, and politely pretend it enjoys quarterly planning? That was the easy part. The less glamorous question is now becoming more expensive: how do we serve all these model calls reliably, cheaply, and at scale? ...

Rank and File: Why LoRA Adapters May Be Bigger Than They Need to Be

Opening — Why this matters now Fine-tuning large models used to sound like a research luxury. Now it is a line item in the infrastructure budget. Enterprises do not want one general-purpose model behaving vaguely usefully for everyone. They want domain-specific behavior: a support adapter for insurance claims, a compliance adapter for legal review, a financial-document adapter for analyst workflows, perhaps a dozen regional variants, and then another dozen because someone discovered “brand tone” during a steering committee meeting. Naturally. ...