AI Alignment

Silent Scholars, No More: When Uncertainty Becomes an Agent’s Survival Instinct

RAG is a very polite librarian. It fetches documents, quotes passages, and helps an agent look less ignorant in public. Then the agent closes the book, answers the user, and leaves no trace except a chat log, a cache entry, or perhaps another small pile of private “reflections” that no one else will ever see. ...

Delegating to the Almost-Aligned: When Misaligned AI Is Still the Rational Choice

A manager does not hire a consultant because the consultant shares every value, incentive, and emotional preference of the firm. The consultant wants fees. The doctor wants throughput. The lawyer wants billable hours. The cloud provider wants usage. Humanity, somehow, survives this scandal. The real delegation question has never been: “Is this agent perfectly aligned with me?” It is: “Will things go better if I let this agent decide here?” ...

When Rewards Learn Back: Evolution, but With Gradients

Rewards are where many agent projects go to become expensive folklore. A team wants an AI agent to complete long workflows: search, reason, call tools, check constraints, recover from mistakes, and produce a useful answer. The model can talk. The tools work. The benchmark demo is acceptable. Then reinforcement learning enters the room, and someone has to decide what “good” means at every step. ...

Vectors of Influence: When Beliefs Survive the Geometry of Minds

A meeting ends. Everyone says they understand the strategy. The slides were clean. The CEO was calm. The product lead nodded in the right places. Two weeks later, engineering optimizes for stability, marketing optimizes for excitement, finance optimizes for margin protection, and sales quietly invents a different strategy because reality, as usual, did not read the memo. ...

Value Collision Course: When LLM Alignment Plays Favorites

A support chatbot does not wake up one morning with a worldview. It gets one, slowly, through the dull machinery of product decisions: who labels the data, how many options they can choose from, whether disagreement is kept or ironed flat, and which optimization method gets the privilege of turning messy human judgement into model behaviour. ...

Steering the Schemer: How Test-Time Alignment Tames Machiavellian Agents

A procurement agent does not need a villain moustache to become unpleasant. Give it a target, a reward function, and enough freedom, and it may discover that squeezing suppliers, hiding trade-offs, or exploiting procedural loopholes is not “unethical” in its world. It is just efficient. That is the point of the MACHIAVELLI benchmark, and also the reason the paper Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping is worth reading carefully.1 The paper is not selling a new moral soul for AI agents. Thankfully. We have enough vendors selling souls already. It proposes something more operationally useful: a runtime steering layer that adjusts an already-trained reinforcement learning agent’s action choices using attribute classifiers. ...

Active Minds, Efficient Machines: The Bayesian Shortcut in RLHF

TL;DR for operators Labels are the awkward invoice behind modern alignment. RLHF looks elegant in diagrams: generate outputs, ask humans which one is better, train a reward model, optimise the policy, repeat until everyone pretends the reward model is civilisation. In practice, most preference comparisons are not equally useful. Some are obvious. Some are redundant. Some teach the model almost nothing except that annotator budgets have a sense of humour. ...

Enemy at the Gates, Friends at the Table: Why Competition Makes LLM Agents More Cooperative

TL;DR for operators Competition is usually sold as the thing that makes agents sharper, more adversarial, and perhaps a little too pleased with themselves. This paper points in a more useful direction: controlled external competition can make agent teams more cooperative internally, but only when it is paired with repeated interaction. The study places Qwen3 14B, Phi4 reasoning, and Cogito 14B agents into Iterated Prisoner’s Dilemma tournaments under three conditions: repeated interaction only, group competition only, and a combined “super-additive” setup where agents face both team structure and repeated encounters.1 For Qwen3 and Phi4, the combined setting produces the strongest cooperation. Qwen3’s mean cooperation rate rises from 0.22 in repeated interaction and 0.23 in group competition to 0.32 in the combined setting. Phi4 moves more sharply, from 0.21 and 0.13 to 0.43. ...

Steering by the Token: How GRAINS Turns Attribution into Alignment

TL;DR for operators GRAINS is not “fine-tuning, but cheaper.” That framing misses the point and commits the usual business sin of turning a mechanism into a procurement slogan. The paper’s useful claim is more specific: token-level attribution can be converted into an inference-time steering signal. Instead of retraining model weights, GrAInS identifies which text or image tokens most strongly push the model toward preferred or dispreferred outputs, builds layer-wise steering vectors from those activation shifts, and applies normalized edits during inference.1 ...

The Clock Inside the Machine: How LLMs Construct Their Own Time

TL;DR for operators Dates look harmless. They sit in spreadsheets, contracts, forecasts, audit trails, delivery plans, and board decks pretending to be objective little integers. The problem is that a language model may not treat them as just integers. A new paper, The Other Mind: How Language Models Exhibit Human Temporal Cognition, studies how 12 large language models judge similarity between years from 1525 to 2524.1 The authors find that larger models often organise years around a subjective reference point near the recent present, rather than simply comparing numerical distance. The models also show logarithmic compression: years farther from that reference point become less finely distinguished, in a pattern reminiscent of the Weber-Fechner law in human perception. ...