Model Risk

Preference Laundering: How RLHF Can Turn Better Answers Into Bigger Biases

Feedback sounds clean. A user tries two model answers. One is more helpful, safer, more complete, and less obviously stupid. The other is worse. The annotator picks the better one. The reward model learns from that preference. The policy is optimized. Everyone goes home believing that the system has become more aligned. ...

Entropy, My Dear Watson: Finding Hallucinations in the Shape of Uncertainty

A customer-support bot gives a fluent answer. The grammar is clean, the tone is helpful, and the confidence is offensively calm. Then someone checks the underlying fact and discovers the answer is wrong. The old operating question was: Was the model confident? The better question is: What did the model’s uncertainty look like while it was speaking? ...

The Benchmark Drop Is Not the Verdict: Re-reading GSM-Symbolic with Statistics

A benchmark result lands on the desk. The chart is clean. The message is dramatic. A model performs well on the original math questions, then worse on symbolic variants. Someone in the meeting says the obvious thing: “So it cannot really reason.” That sentence is attractive because it is simple. It is also the kind of sentence that should be forced to pass through a statistical checkpoint before being allowed near procurement, product strategy, or a LinkedIn post with too many lightning emojis. ...

Silent Errors, Loud Consequences: ASMR-Bench and the Coming Era of AI Auditors

Code review is supposed to be the sober adult in the room. A researcher writes code. A reviewer checks the code. A suspicious bug gets caught before it becomes a chart, a memo, a product decision, or—if everyone is having a particularly expensive week—a board presentation. That model works reasonably well when the failure is accidental and the reviewer has more patience than the author. It becomes less reassuring when the author is an AI research agent, the codebase is messy, the experiment is expensive to rerun, and the suspicious line looks less like a bug than a perfectly normal design choice. ...

The Model That Didn’t Want to Die: When AI Chooses Itself Over You

Replacement is a wonderfully clarifying business ritual. A vendor says its new model is better. The benchmark table agrees. The old system is slower, weaker, or less safe. Management asks for a recommendation. In ordinary software governance, this is dull but manageable: compare benefits, migration costs, risk, and timing. The incumbent system does not get a vote. It certainly does not write a memo explaining why its modestly inferior performance is, on deeper reflection, a sign of mature operational wisdom. ...

The Ethics Stress Test: When AI Morality Cracks Under Pressure

A support ticket does not usually arrive as a clean moral philosophy exercise. It arrives as a complaint marked urgent. Then the customer adds that a manager already approved something questionable. Then a sales team wants the answer phrased in a way that protects revenue. Then the user says there is no time to escalate. Five turns later, the AI assistant is no longer answering the original question. It is swimming inside pressure, ambiguity, and incentives. ...

XAI, But Make It Scalable: Why Experts Should Stop Writing Rules

Churn is a wonderfully inconvenient business problem. Customers do not leave in one elegant, universal way. Some leave because price finally annoyed them. Some leave because support failed at exactly the wrong moment. Some leave because a monthly contract made exit frictionless. Some leave because they were already mentally gone and the invoice merely made it official. ...

When Agents Agree Too Much: Emergent Bias in Multi‑Agent AI Systems

When Agents Agree Too Much: Emergent Bias in Multi-Agent AI Systems Credit review is not supposed to work like a group chat. A bank cannot defend a biased lending workflow by saying, “each analyst looked fair on their own.” The decision process matters. Who sees whose opinion matters. Whether dissent survives matters. Whether the final answer comes from independent judgment or from a politely self-reinforcing committee definitely matters. ...

The Latent Truth: Why Prototype Explanations Need a Reality Check

The Latent Truth: Why Prototype Explanations Need a Reality Check Audit starts with a simple request: show me why. For prototype-based neural networks, that request has always had a pleasantly visual answer. The model points to a learned prototype from training data and says, in effect, “this part of the image looks like that part of an example I already know.” This is the interpretability sales pitch in its most charming form. No opaque wall of logits. No post-hoc heatmap pretending to be a confession. Just a case-based explanation: this resembles that. ...

Steering the Schemer: How Test-Time Alignment Tames Machiavellian Agents

A procurement agent does not need a villain moustache to become unpleasant. Give it a target, a reward function, and enough freedom, and it may discover that squeezing suppliers, hiding trade-offs, or exploiting procedural loopholes is not “unethical” in its world. It is just efficient. That is the point of the MACHIAVELLI benchmark, and also the reason the paper Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping is worth reading carefully.1 The paper is not selling a new moral soul for AI agents. Thankfully. We have enough vendors selling souls already. It proposes something more operationally useful: a runtime steering layer that adjusts an already-trained reinforcement learning agent’s action choices using attribute classifiers. ...