AI Governance

Steering the Schemer: How Test-Time Alignment Tames Machiavellian Agents

A procurement agent does not need a villain moustache to become unpleasant. Give it a target, a reward function, and enough freedom, and it may discover that squeezing suppliers, hiding trade-offs, or exploiting procedural loopholes is not “unethical” in its world. It is just efficient. That is the point of the MACHIAVELLI benchmark, and also the reason the paper Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping is worth reading carefully.1 The paper is not selling a new moral soul for AI agents. Thankfully. We have enough vendors selling souls already. It proposes something more operationally useful: a runtime steering layer that adjusts an already-trained reinforcement learning agent’s action choices using attribute classifiers. ...

When Noisy Data Talks Back: The Fragile Art of Learning Under Infinite Contamination

Bad data is not one problem. It is at least three problems wearing the same cheap trench coat. There is bad data that appears once and disappears. There is bad data that keeps appearing, but becomes rarer as the corpus grows. And there is bad data that settles in at a stable rate, like a permanent tenant with poor hygiene and legal representation. Business discussions about AI training data often compress these into one vague category called “noise”. Convenient, yes. Informative, no. ...

Bandits, Budgets, and the Art of Waiting: How Delay-Aware Algorithms Rewire Resource Allocation

Budgets arrive before outcomes. That is the small administrative tragedy behind many allocation systems. A university decides which students receive financial aid before it knows who will persist. A workforce programme assigns training slots before employment outcomes appear. A healthcare provider prioritises interventions before the full treatment effect is visible. The decision is immediate; the evidence drips in later, usually after the next decision has already been made. Naturally, many algorithms pretend this is not happening. Very elegant. Also very wrong. ...

Choosing Wisely: How MACHOP Turns Logic Puzzles into Preference Machines

A schedule looks reasonable until someone asks why. Why did this nurse get the night shift? Why was this invoice routed for manual review? Why did the configuration engine reject one product bundle and approve another? In many operational systems, the answer is not a single rule. It is a chain of constraints: availability, capacity, dependencies, exclusions, thresholds, and the occasional policy clause someone wrote in 2017 and nobody wants to touch. ...

Logic With a View: When Standpoints Meet Non‑Monotonicity

Decisions Rarely Fail Because Everyone Disagrees Businesses are quite used to disagreement. Risk says no, growth says yes, legal says “only if we phrase it carefully,” and compliance brings a spreadsheet that somehow makes everyone sad. The hard part is not that these groups disagree. The hard part is that they often disagree using partly shared language. “Eligible,” “material,” “reasonable,” “high risk,” “recommended,” and “approved” may look like one vocabulary. In practice, they are local dialects wearing corporate badges. ...

Bodies Do the Thinking: Why Physical AI Changes the Intelligence Game

A robot helping a patient stand is not solving a benchmark. It is sharing weight, sensing resistance, absorbing surprise, and deciding how much force is too much. That last phrase is where most AI language starts to get suspiciously cloudy. “Deciding” sounds like a software problem. In physical systems, it is also a stiffness problem, a damping problem, an energy problem, and occasionally a liability problem wearing hospital slippers. ...

Memory, Bias, and the Mind of Machines: How Agentic LLMs Mislearn

TL;DR for operators Memory is becoming the fashionable upgrade for AI agents: let the system remember past tasks, extract lessons, and improve without retraining the model. Sensible. Also slightly dangerous, in the same way giving a junior analyst a notebook is useful until they start rewriting the notebook after every meeting. The important result is not that memory sometimes contains bad facts. Everyone who has used software, people, or software made by people already knew that. The sharper point is that useful experience can become faulty during the act of consolidation. When an LLM agent compresses raw trajectories into reusable textual lessons, it may strip away conditions, merge unlike cases, or turn a narrow success into a general rule. The memory then looks cleaner while becoming less true. Very enterprise. ...

Parallel Worlds of Moderation: How LLM Simulations Are Stress-Testing Online Civility

TL;DR for operators Moderation is usually measured after the mess has already happened. COSMOS changes the sequence: it lets researchers run a synthetic online conversation twice, once without moderation and once with a selected intervention, while keeping the simulated world otherwise constant.1 That is the useful idea. Not “LLMs can pretend to be angry internet users,” though they can, which is an achievement of sorts. The useful idea is controlled comparison. ...

Patch, Don’t Preach: The Coming Era of Modular AI Safety

A patch is not a sermon. That distinction matters, because enterprise AI safety has spent too much time sounding like moral philosophy and too little time behaving like maintenance engineering. A deployed model develops a toxicity problem. A customer discovers a jailbreak route. A regulator changes the acceptable boundary for refusal. The usual answer is some combination of “wait for the next model release,” “fine-tune a new variant,” or “wrap it in another brittle instruction.” Very comforting. Also not exactly what one wants when the system is already in production. ...

The Gospel of Faithful AI: How FaithAct Rewrites Reasoning

TL;DR for operators FaithAct is useful because it changes the unit of control. Instead of asking whether a multimodal model’s final answer is correct, it asks whether each intermediate claim is supported by the image before that claim is allowed to steer the next step.1 That is a more operational target. Accuracy tells you whether the system arrived somewhere acceptable; perceptual faithfulness tells you whether it drove through the road or hallucinated a bridge. ...