Robustness

The Viscosity Budget: Why Softmax Is Not Just a Knob

TL;DR for operators A new paper by Jose Marie Antonio Miñoza, Erika Fille T. Legara, and Christopher P. Monterola argues that a log-sum-exp neural layer is not merely analogous to a viscous Hamilton-Jacobi equation. Under the paper’s parameterisation, it is exactly the Hopf-Cole solution of one, evaluated at the input point.1 The operational point is not “neural networks are physics now”, although someone will certainly try to put that on a slide. The point is cleaner: one parameter, $\varepsilon$, simultaneously controls softmax temperature, PDE viscosity, and entropy-regularised convex optimisation. That makes smoothness, expressiveness, robustness, attribution sharpness, and scaling behaviour mathematically coupled. ...

Context Is Not a Costume: Why Strong Agents Still Fail on Contact

The agent looks ready. Then reality answers back. The current AI-agent story is conveniently simple. Take a powerful foundation model, wrap it in tools, give it a workflow, add a polite system prompt, and call the result “ready for deployment.” Reality, as usual, has poor manners. Two recent arXiv papers examine very different agent settings. One studies whether multimodal AI agents can align their behavior with the cognitive age of child users. The other studies whether behavior foundation models for imitation learning can remain robust when the physical dynamics of an environment shift after training. They do not share a benchmark, a model class, or even the same deployment domain. That is precisely why they are useful together. ...

Claw-Eval — When Agents Game the System, the System Needs Claws

The agent finished the task. That is not the same as doing the task. Inbox sorted. Calendar updated. Report generated. Customer record changed. Dashboard refreshed. For a demo, that is usually enough. The screen shows a plausible answer, the final artifact looks tidy, and everyone politely pretends the agent must have followed the correct path because the output did not immediately burst into flames. ...

Mind the Gap: Why Continual Learning Fails—and How Local Classifier Alignment Fixes It

Updating a model sounds harmless until the old parts of the system start reading the new representations incorrectly. That is the less theatrical version of catastrophic forgetting. Not the dramatic story where a neural network “forgets everything” like a distracted intern. The more useful story is quieter: a deployed AI system adapts its backbone to new data, the feature space shifts, and classifiers trained for earlier tasks are left calibrated to yesterday’s geometry. ...

Stable World Models, Unstable Benchmarks: Why Infrastructure Is the Real Bottleneck

A robot does not fail politely. It does not say, “I was trained on a slightly different shade of blue.” It just misses the object, pushes the wrong way, or confidently follows a plan that only works in the tidy little universe where the benchmark was born. That is the uncomfortable lesson behind stable-worldmodel-v1, a paper that is less about inventing a new world model and more about asking whether world-model research has been measuring the right thing in the first place.1 ...

When Agents Stop Talking to the Wrong People

Communication sounds harmless until the wrong person gets the microphone. That is true in meetings. It is also true in multi-agent AI systems. The polite version says agents “collaborate,” “debate,” and “refine each other’s reasoning.” The less decorative version is that one agent’s output becomes another agent’s input. If the first agent is wrong, confused, strategically misleading, or simply having one of those tiny synthetic breakdowns that LLMs have with impressive confidence, the system has just created a distribution channel for bad judgment. ...

When One Patch Rules Them All: Teaching MLLMs to See What Isn’t There

Image security has an awkward habit of sounding theoretical until the image is inside a business workflow. A product team adds an image-upload feature. A compliance team uses multimodal models to inspect screenshots. A support bot reads photos from customers. A research assistant summarizes figures from PDFs. Everyone understands that the model may occasionally misread an image. That is ordinary error. Annoying, but ordinary. ...

Same Moves, Different Minds: Rashomon Comes to Sequential Decision-Making

A taxi is a useful little trap. It looks harmless: pick up passengers, drive them to destinations, do not run out of fuel. A small grid-world taxi environment is not exactly the sort of thing that makes executives whisper “agentic transformation” over terrible conference coffee. But that is precisely why it works. Strip away the enterprise theatre, and sequential decision-making becomes easier to see. An agent observes a state, chooses an action, receives the next state, and repeats. If two agents always make the same moves and achieve the same objective, most organizations would treat them as equivalent. Same behavior, same operational meaning. Audit passed. Ship it. ...

How to Make Neural Networks Talk: Register Automata as Their Unexpected Interpreters

How to Make Neural Networks Talk: Register Automata as Their Unexpected Interpreters Prices move. Sensors drift. Users click, pause, return, disappear, and sometimes behave exactly like a Markov chain with a caffeine problem. Modern sequence models are good at turning such streams into decisions. A recurrent network or transformer can look at a run of numbers and say: buy, flag, reject, approve, alert. What it usually cannot do is explain the rule it has learned in a form that a risk team, engineer, or auditor can actually inspect. ...

Pop-Ups, Pitfalls, and Planning: Why GUI Agents Break in the Real World

Pop-up. That tiny word hides a surprisingly large operational problem. A human sees a battery warning, an update prompt, a permission dialog, or a frozen app and does something boringly competent: dismiss it, recover context, re-check the screen, and continue. A GUI agent, meanwhile, may confidently continue a plan that no longer matches reality. The machine has not “failed” in the theatrical sense. It has simply treated a live workflow like a polite screenshot sequence. Very enterprise. Very doomed. ...