AI Systems

Skill Issue or System Design? How LLMs Actually Follow Instructions

The checklist problem that exposes the model Checklist tasks look boring. That is exactly why they are useful. Ask an LLM to write a formal email under 50 words, include one required term, avoid another term, and return the result as JSON. None of this sounds intellectually difficult. No theorem proving. No multimodal reasoning. No dramatic benchmark leaderboard screenshot. Just instructions. ...

Gamma Rays and Toolboxes: Why Superintelligence May Be a Systems Engineering Problem

Toolboxes are not glamorous. Nobody gives a keynote about the screwdriver. Nobody writes breathless think-pieces about the socket wrench. But when a complicated system fails, the difference between “genius” and “expensive confusion” is often whether the operator had the right tool, used it at the right moment, and trusted it to do the part humans should not pretend to do mentally. ...

Speculation, But With Standards: Training Draft Models That Actually Get Accepted

Queue. That is still the least glamorous word in AI infrastructure, and probably the most honest one. A user asks a model to write code, summarize a filing, inspect an image, or reason through a customer ticket. The model knows what to do, more or less. The bottleneck is not ambition. It is waiting: one token after another, one expensive forward pass after another, while the GPU performs a very sophisticated version of typing slowly. ...

Ask Once, Query Right: Why Enterprise AI Still Gets Databases Wrong

Database. That is where many enterprise AI demos quietly go to die. The user asks one clean natural-language question: “How many customers are in California?” The AI assistant smiles politely, searches something, finds a table that looks relevant, and returns a confident answer. The problem is not that the model cannot understand English. The problem is that five internal databases may all contain customers, states, locations, stores, loans, accounts, or sales regions. Some can answer the question. Some can almost answer it. Some merely smell like they can answer it. ...

From Talking to Living: Why AI Needs Human Simulation Computation

The chatbot that cannot check the door A useful AI assistant can write an email, summarize a meeting, explain a regulation, or generate a plan for fixing a server problem. Then something inconvenient happens: the real world disagrees. The meeting transcript missed one speaker. The regulation changed in one jurisdiction. The server error was not caused by the code but by two services fighting over the same port. The customer sounded satisfied in the chat log but cancelled the contract two days later. The model can still talk. Beautifully, even. But it cannot always live inside the situation long enough to notice that its first answer has become stale, incomplete, or simply wrong. ...

When Models Read Too Much: Context Windows, Capacity, and the Illusion of Infinite Attention

The demo is familiar now. Someone drops a whole contract, a whole policy manual, a whole code repository, or a month of chat history into a model and asks one neat question. The model answers fluently. The room relaxes. The slide says “1M-token context.” Procurement starts smiling. This is where the trouble begins. ...

MI-ZO: Teaching Vision-Language Models Where to Look

Camera placement is an unglamorous way to lose an AI project. A vision-language model may recognize doors, ladders, rocks, chairs, and surface textures perfectly well in ordinary images. Point the camera at the wrong side of an object, however, and the relevant feature disappears. Show the model eight similarly unhelpful views and it has received more data without receiving more evidence. ...

Replace, Don’t Expand: When RAG Learns to Throw Things Away

The inbox problem hiding inside RAG Inbox. That is the easiest way to understand what goes wrong in many retrieval-augmented generation systems. A query arrives. The system retrieves a few documents. The answer is not obvious. So the system retrieves more. Then more. Then perhaps a web search result. Then a rewritten query. Then another bundle of passages. ...