AI Infrastructure

The Tail That Wags the Model: Why p99 Latency Should Run Your LLM

A demo can survive a slow answer. A production service cannot survive the slow answer that arrives just often enough to make users stop trusting the product. That is the quiet problem behind p99 latency. The average response time tells you how the service feels on a normal day. p99 tells you what happens to the unlucky one percent: the support agent waiting in front of a customer, the analyst refreshing a dashboard, the employee whose workflow now includes watching a spinner and reconsidering their life choices. ...

Green Lights, Smarter Cities: How Multi‑Agent Reinforcement Learning Is Rewiring Urban Traffic

Traffic lights are not stupid. They are obedient. That is the problem. A fixed-time signal does exactly what it was told to do: hold this green for this long, clear the junction, move to the next phase, repeat. It does not care that one lane is empty, another is spilling backward, and a third has just received a platoon of vehicles from the previous intersection. It is not being malicious. It is merely following a plan designed for a world that stopped changing five minutes ago. ...

Flash Before the First Token: How FlashPrefill Rewrites the Economics of Long Context

Waiting is the least glamorous part of AI. A user uploads a contract, a codebase, a board pack, or a pile of research notes. The model does not answer immediately. First, it reads. Technically, it prefills: it processes the prompt, builds the internal key-value cache, and prepares the first generated token. In short prompts this feels invisible. In long-context systems, it becomes the awkward pause where the “agent” looks suspiciously like a very expensive loading spinner. ...

Mind the Units: Why LLMs Still Can't Count (And How CONE Fixes It)

Numbers look harmless until they enter a business database. A revenue field says 50. A dosage field says 50. An age field says 50. A follow-up period says 50. A unit may be present, missing, abbreviated, buried in the column header, or inconsistently written as ml, mL, or something the spreadsheet inherited from a PDF extraction pipeline during its villain era. ...

When Tokens Explode: The Hidden Geometry Behind Attention Sinks

Serving an LLM is usually discussed in pleasantly managerial language: latency, throughput, context windows, GPU memory, quantization, cache eviction. Nice clean nouns. Then the model ruins the spreadsheet by producing internal activations that are thousands of times larger than ordinary values, while some tokens quietly become attention magnets for reasons that are not exactly semantic. Very professional behavior from a trillion-dollar technology stack. ...

Small Model, Big Eyes: Why Microsoft’s Phi‑4 Vision Model Is a Warning Shot to Giant Multimodal AI

Screen. That is where many ambitious AI agents quietly embarrass themselves. Not in a grand philosophical test of intelligence. Not in a graduate-level theorem. Just on a screen: a small button, a chart label, a checkout field, a misread table cell, a tiny icon in a crowded interface. The model can explain strategy, summarize policy, and generate six polite versions of an apology email, but then it clicks the wrong thing because it did not really see the thing. ...

Beyond the Linear Ceiling: Why Non-Linearity Is the Next Frontier in PEFT

More Rank Is Not Always More Capacity Fine-tuning teams love a simple knob. If the model underperforms, increase rank. If the adapter looks too small, increase rank. If the downstream task is hard, increase rank again and call it strategy. This is comforting because rank is measurable, budgetable, and easy to explain in a meeting. Unfortunately, reality has its usual habit of being less cooperative. ...

Spectral Therapy for Transformers: Predicting Divergence Before It Hurts

Training failure has a special talent for arriving late. Not late in the philosophical sense. Late in the operational sense: after the run has already consumed GPU time, after the team has already waited, after the dashboard has already looked tolerable long enough to invite optimism. Then the loss spikes, the gradient norm goes feral, and everyone pretends this was “useful learning.” Sometimes it is. Often it is just expensive smoke. ...

Gamma Rays and Toolboxes: Why Superintelligence May Be a Systems Engineering Problem

Toolboxes are not glamorous. Nobody gives a keynote about the screwdriver. Nobody writes breathless think-pieces about the socket wrench. But when a complicated system fails, the difference between “genius” and “expensive confusion” is often whether the operator had the right tool, used it at the right moment, and trusted it to do the part humans should not pretend to do mentally. ...

Lost in the Repo: Why Bigger Context Windows Still Miss the Point

Context is comforting. A large context window gives managers, developers, and product demos the same pleasant illusion: if the model can see enough of the repository, it should stop missing important files. Put the whole codebase into the window. Add retrieval if necessary. Let the agent read, reason, edit, and move on. ...