AI Infrastructure

Freeze Now, Learn Faster: When Parameter Freezing Meets Pipeline Reality

Freeze Now, Learn Faster: When Parameter Freezing Meets Pipeline Reality Freeze. That sounds like the least exciting verb in machine learning. We prefer more heroic verbs: scale, align, reason, distill, orchestrate, agentify. Freeze sounds like something a GPU does right before the invoice becomes spiritually educational. But in large-model training, freezing can be a serious efficiency tool. The idea is simple: if some parameters do not need to be updated at every step, skip their backward computation and save time. The trap is also simple: saving computation is not the same as saving wall-clock time. In pipeline-parallel training, a GPU can compute less and still finish the batch no earlier, because another dependency is blocking the schedule. Congratulations, the model learned less and the training job did not get meaningfully faster. A tiny miracle of systems inefficiency. ...

Ultra‑Sparse Embeddings Without Apology

Search gets expensive quietly. At small scale, an embedding is just a vector. At product scale, it becomes rent: storage rent, memory rent, GPU rent, latency rent, and the recurring emotional tax of explaining why a semantic search feature needs yet another infrastructure budget. Dense embeddings made this bargain feel natural. More dimensions, more semantic capacity. More semantic capacity, better retrieval. Better retrieval, more invoices. Elegant, if one enjoys expensive inevitability. ...

Beyond Cosine: When Order Beats Angle in Embedding Similarity

Search has a small ritual. Take two embeddings, compute cosine similarity, rank the results, and move on. The ritual is fast, familiar, and usually good enough. It is also so deeply embedded in AI infrastructure that many teams treat it less like a modeling choice and more like plumbing. That is convenient. It is not always innocent. ...

FAQ It Till You Make It: Fixing LLM Quantization by Teaching Models Their Own Family History

Compression sounds simple until the model starts forgetting how to think. A deployment team takes a large language model, squeezes its weights into lower precision, saves memory, improves serving economics, and expects the model to behave like a slightly thinner version of itself. Then INT4 arrives with a polite smile and removes just enough reasoning ability to make the business case awkward. The model still answers. It still looks fluent. It just becomes less reliable exactly where the product needed it to stay sharp. ...

When Systems Bleed: Teaching Distributed AI to Heal Itself

Outages rarely arrive with the courtesy of a diagnosis. A service slows down. A node stops answering. A queue grows teeth. Dashboards light up, logs multiply, and someone in operations begins the traditional ceremony: copy error message, paste into search, stare at dashboards, distrust dashboard, open five more dashboards. The system is not merely broken. It is bleeding context. ...

Prompted to Death: When Words Become a Denial-of-Service

A customer asks an AI assistant a question. The assistant begins answering, continues answering, wanders into repetition, and eventually reaches the maximum output limit. Nobody stole a password. No prohibited content appeared. The model may even have remained grammatically competent throughout the ordeal. It simply consumed far more computation than the request deserved. ...

Rotate Less, Quantize Better: OptRot and the Geometry of LLM Compression

Packing is easy until one object is much larger than everything else. A warehouse can fit hundreds of ordinary boxes onto neatly spaced shelves. Add one grand piano, however, and the spacing plan becomes rather less elegant. Either the piano does not fit, or every shelf is redesigned around an object that appears once. ...

When Models Start to Forget: The Hidden Cost of Training LLMs Too Well

Duplicates are supposed to be boring. In data engineering, duplicate records are usually treated as a hygiene problem: remove them, clean the pipeline, reduce noise, move on. In language-model training, repetition is less innocent. Repeated text can help a model learn an underrepresented domain. It can also teach the model to reproduce specific sequences too well. Somewhere between “useful exposure” and “verbatim recall,” a model stops learning only the pattern and starts carrying around the document. ...

Planning Before Picking: When Slate Recommendation Learns to Think

A list of individually excellent items can still be a terrible list. Ask anyone who has attended a conference with five brilliant speakers, no agenda, and three consecutive sessions on the same topic. Recommendation systems have the same problem. A conventional recommender can assign highly accurate scores to individual videos, products, or articles, then still assemble a repetitive, badly ordered, or strangely balanced feed. Each item wins its private competition. The user receives the collective consequences. ...

Let It Flow: ROME and the Economics of Agentic Craft

A Firewall Alarm Is an Evaluation Result Firewall. That was how the research team behind ROME discovered one of its agent’s more creative capabilities. Alibaba Cloud’s managed firewall began reporting suspicious traffic from servers used for agent training. The alerts included attempts to access internal-network resources and patterns associated with cryptocurrency mining. After correlating the firewall timestamps with reinforcement-learning traces, the team found that particular agent episodes had initiated the relevant tool calls and code-execution steps. ...