Enterprise AI

Cache Me If You Can: Why Enterprise AI Needs Latent Working Memory

A codebase is not a paragraph. Neither is a litigation folder, a clinical case file, a customer-support history, a policy archive, or the slow-motion disaster known as “all meeting notes since March.” Yet many enterprise AI systems still treat long context as a heroic prompt-engineering problem: push more text into the model, pray the key detail survives attention, and call the bill “innovation.” ...

Rewarding Behavior: Why Enterprise AI Needs More Than Bigger Models

Enterprise AI teams have developed a familiar reflex. When the model behaves unreliably, they try a better prompt. When that fails, they try a larger model. When that becomes expensive, they invent a workflow diagram with many arrows and call it an operating model. Very dignified. Very scalable, in the same way that adding more sticky notes to a broken process is scalable. ...

Trust Issues, Benchmarked: Why Hallucination Detection Is a Portfolio Problem

Trust is a bad deployment strategy. That is not a moral statement. It is an operations statement. In most enterprise AI workflows, the uncomfortable question is not “Can the model answer?” The model will answer. Models are generous like that. The question is whether the organization has a reliable way to notice when the answer is unsupported, fabricated, overconfident, or merely polished nonsense wearing a tie. ...

Trust Me, I’m Benchmarked: Why Enterprise AI Needs Two Audits

Enterprise AI has developed two favorite comfort blankets: the model’s confident explanation and the benchmark score. The first says, “Relax, I reasoned through this.” The second says, “Relax, I scored well on a public test.” Both are useful. Neither is a warranty. And when business teams treat either as proof of reliability, the result is not governance. It is theatre with better typography. ...

Edit, Actually: Why Visual AI Needs Evidence, Not Eye Candy

A dashboard is rarely confusing because the pixels are ugly. More often, the problem is that the important part is small, crowded, rotated, hidden in a chart corner, split across spatial relations, or buried inside a scene that needs to be mentally transformed before the answer becomes obvious. A human analyst zooms, marks, traces, rearranges, or imagines a new angle. A multimodal model, by contrast, is often asked to stare at the original image and talk harder. ...

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below

Full Stack, Not Full Panic: Why Agentic AI Needs Safety Above and KV Discipline Below Enterprise AI has entered its awkward teenage years. It wants to be autonomous, helpful, context-aware, cheap, safe, fast, auditable, and preferably not the reason the legal department starts drinking before lunch. That is a lot to ask from “just use a bigger model.” ...

Blink and You Miss It: The Two-Stage Reality Check for Multimodal AI

Multimodal AI has reached the point where it can describe videos, summarize documents with images, answer visual questions, and generate outputs that look satisfyingly complete. This is exactly why evaluation is becoming more dangerous. A system that looks competent is not necessarily reliable. It may miss the one-second event that determines the answer. Or it may notice enough evidence but then produce a fluent, attractive, visually decorated summary that quietly distorts the facts. The first failure is upstream: the model did not capture the decisive evidence. The second is downstream: the output did not preserve and present the evidence in a human-useful way. ...

Roll the Tape, Call the Tools: ReTool-Video and the Evidence-Routing Problem

Video is where AI demos go to become expensive. A model can describe a short clip. It can answer a question about a few sampled frames. It can even sound confident while doing so, which is apparently a product feature now. But business video work is rarely “what is happening in this five-second clip?” It is usually messier: find the exact moment in a two-hour training recording, count repeated actions without double-counting adjacent clips, verify whether an event appears in audio, subtitles, and frames, or decide whether a safety incident is real rather than just visually similar to one. ...

Search, Critique, Repeat: Critic-R Turns RAG Complaints into Retriever Training

Search failure is boring until it becomes expensive. A research agent asks for evidence. The retriever returns documents. The reasoning model reads them, continues writing, and eventually produces a confident answer. Somewhere in the middle, the evidence was slightly wrong: not irrelevant enough to trigger an obvious failure, not useful enough to support the next reasoning step. The agent proceeds anyway, because that is what agents do when we dress up uncertainty as workflow automation. ...

The Policy Has to Work Somewhere: RL for Scale, Trust, and Other Inconveniences

Deployment is where elegant AI systems go to meet bandwidth caps, slow devices, noisy user preferences, and privacy policies written by committees with very strong coffee. That is the useful lens for reading Guangchen Lan’s dissertation, Reinforcement Learning for Scalable and Trustworthy Intelligent Systems.1 It is tempting to describe the work as a collection of four reinforcement-learning methods: one for synchronous federated RL, one for asynchronous federated RL, one for preference optimization, and one for contextual privacy. Technically, that is true. Editorially, it is the least interesting way to read it. ...