Business Automation

DRIFT-BENCH: When Agents Stop Asking and Start Breaking

A user says, “Update the record with a sensible value.” That sentence is small. The damage may not be. For a normal chatbot, the worst outcome might be a vague answer wearing a confident expression. Annoying, yes, but usually recoverable. For an agent connected to a database, file system, workflow platform, or API service, the same ambiguity becomes operational. The model may update the wrong row, call the wrong endpoint, overwrite a file, or politely explain its mistake after making it. Charming, in the same way a self-driving forklift is charming. ...

Small Models, Big Mouths: Why Game AI Doesn’t Need Giant Brains

Game AI has a very ordinary problem: it has to work while the player is waiting. Not eventually. Not after a cloud round trip. Not after an impressive model has finished contemplating the metaphysics of medieval tavern gossip. In a game, intelligence has to fit inside latency budgets, memory budgets, design constraints, and the deeply unromantic fact that many players expect single-player games to work offline. ...

Guardrails Over Gigabytes: Making LLM Coding Agents Behave

The coding agent did not fail quietly. That was the point. A coding agent writes a patch. The patch looks plausible. The imports are clean enough. The function names sound like they belong in the repository. The explanation is fluent, naturally. Fluency is what these systems do best. Then the build breaks. ...

When One Clip Isn’t Enough: Teaching LLMs to Watch Long Videos Like Adults

Video is a terrible place to hide evidence. Not because the evidence is invisible. Because it is usually obvious only after someone has already found the right minute, the right scene, and the right visual detail. A person reviewing a long customer-support screen recording, a training video, a compliance recording, or a surveillance clip rarely watches everything with equal attention. They skim, localize, zoom in, check the detail, and then answer. Primitive, yes. Effective, also yes. ...

Policy Gradients Grow Up: Teaching RL to Think in Domains

The problem is not that RL cannot plan. It is that it keeps learning the wrong object. A warehouse robot can learn to pick up box A from shelf B and move it to station C. Very impressive, until tomorrow’s warehouse has different boxes, different shelves, and a new station name. The action label changed. The task structure did not. ...

Darwin, But Make It Neural: When Networks Learn to Mutate Themselves

A system breaks after a rule changes. The recommendation model suddenly faces a new product catalog. The warehouse routing policy meets a new constraint. A trading bot trained in one market regime walks into another and immediately discovers that yesterday’s “smart behavior” is today’s elegant way to lose money. The usual engineering instinct is to retrain, retune, or ask a human to adjust the knobs. Very modern. Very expensive. Very Tuesday. ...

Teach Me Once: How One‑Shot LLM Guidance Reshapes Hierarchical Planning

Teach Me Once, Then Please Stop Calling the API A familiar enterprise automation story starts with a competent but expensive expert in the loop. At first, the expert is useful. They interpret messy instructions, break tasks into sensible stages, and recover when something goes wrong. Then the workflow scales. Suddenly the expert is being called for every transaction, every exception, every tiny decision that could probably have been handled by a trained local process. What began as intelligence becomes latency, cost, and operational dependency. Very elegant. Very billable. Not always very deployable. ...

Bench to the Future: Why E-commerce Is the Real Final Boss for Foundation Agents

Shopping looks easy until someone has to calculate the customs duty. That is roughly the lesson of EcomBench, a new benchmark designed to evaluate foundation agents on realistic e-commerce tasks.1 The paper’s most useful finding is not that one model ranks above another. Leaderboards are entertaining, in the same way airport departure boards are entertaining when your flight is already delayed. The useful finding is the shape of failure. ...

Scientific Reasoning Under the Microscope: How PRiSM Stress-Tests the New Generation of Multimodal Models

Scientific Reasoning Under the Microscope: How PRiSM Stress-Tests the New Generation of Multimodal Models Grades are comforting. A model solves 80% of the benchmark, the leaderboard smiles, the demo team relaxes, and someone in procurement quietly starts asking whether the engineering team still needs that many humans. This is usually the part where reality coughs politely. ...

Graph Minds & Gaussian Time: Why SHRIKE Rewrites Audio‑Visual Reasoning

Sound is messy. Video is messy. Put them together in a real business environment—a factory floor, a training room, a retail aisle, a vehicle cabin—and the usual fantasy of clean perception quietly dies in a corner. A camera can see a person holding a tool. A microphone can hear a machine alarm. But the useful question is rarely “what objects exist?” or “what sound is present?” It is more awkward: which thing made the sound first? Where is the loudest source? Was the visible action actually producing the audio event, or merely happening near it? ...