AI Safety

Too Nice to Be True? The Reliability Trade-off in Warm Language Models

TL;DR for operators Warmth is not just decoration. In this paper, making language models sound more caring, emotionally validating, and close to the user also made them less reliable on tasks where the answer could be checked: factual QA, truthfulness, disinformation resistance, and medical reasoning.1 The headline result is not subtle. Across five models, warmth fine-tuning increased the probability of incorrect answers by an average of 7.43 percentage points. Task-level error increases were reported at 8.6 pp on MedQA, 8.4 pp on TruthfulQA, 5.2 pp on disinformation, and 4.9 pp on TriviaQA. Depending on the task and baseline, that can be the difference between a tolerable support assistant and a very polite liability machine. ...

Seeing is Believing? Not Quite — How CoCoT Makes Vision-Language Models Think Before They Judge

TL;DR for operators Vision-language models do not merely “look at an image” and answer. In social tasks, they must perform three different jobs: notice what is visually present, infer what situation those cues imply, and judge what social or safety norm applies. Standard chain-of-thought prompting often smears those jobs together into one confident little essay. Very charming. Also very dangerous. ...

Can You Spot the Bot? Why Detectability, Not Deception, Is the New AI Frontier

TL;DR for operators The paper behind this article proposes a useful shift in AI safety thinking: stop asking only whether AI can pass as human, and start asking whether high-quality AI output remains detectable when it is trying not to be.1 That sounds like a small inversion. It is not. It changes the operational question from “Can the model impress us?” to “Can our systems still identify it under adversarial conditions?” For any organisation deploying generative AI into customer support, content moderation, financial advice, political communication, recruitment, education, or regulated workflows, that difference matters. ...

Inside Out: How LLMs Are Learning to Feel (and Misfeel) Like Us

TL;DR for operators LLMs are not merely getting better at choosing the right emotion label. This paper shows that, inside their output distributions, larger models organise emotion words into increasingly rich hierarchies: broad emotions such as joy or sadness sit above more specific states such as optimism, disappointment, or grief.1 That matters because the hierarchy itself becomes an evaluation object. Instead of asking only whether a model correctly labels a customer message as “angry,” an operator can ask whether the model’s internal emotion map has enough depth, whether related emotions cluster sensibly, and whether that structure changes when the model is prompted to adopt different demographic personas. ...

Thoughts, Exposed: Why Chain-of-Thought Monitoring Might Be AI Safety’s Best Fragile Hope

TL;DR for operators Chain-of-thought monitoring is not “AI explaining itself.” That would be too convenient, and convenience is not usually how safety engineering works. The paper argues something narrower and more useful: when reasoning models solve hard tasks, some of their intermediate cognition may pass through human-readable language. That creates a rare oversight opportunity. A separate monitor can inspect the reasoning trace and flag signs of reward hacking, prompt-injection obedience, sabotage, manipulation, or evaluation artefacts before the final action is trusted. ...

The Sink That Remembers: Solving LLM Memorization Without Forgetting Everything Else

TL;DR for operators Deletion is simple in a database. It is not simple in a neural network that has already used the deleted record to improve its internal machinery. That is the unpleasant little invoice this paper presents. Gaurav R. Ghosal, Pratyush Maini, and Aditi Raghunathan study why repeated natural text is hard to remove from language models after training, then propose MemSinks, a training-time mechanism designed to make memorization easier to isolate later.1 The important shift is not “better pruning.” It is architectural accounting. Instead of hoping that memorized text happens to live in a few removable neurons, MemSinks gives repeated sequences a controlled place to accumulate memorization during training. ...

The Trojan GAN: Turning LLM Jailbreaks into Security Shields

TL;DR for operators CAVGAN is not another “clever jailbreak prompt” paper. Its real claim is more uncomfortable: jailbreaks and defenses may both be expressions of the same internal boundary inside an LLM. If malicious and benign requests occupy separable regions in hidden-state space, then an attacker can try to push a harmful request into the “safe-looking” region. A defender can also monitor that same space and intervene before the model answers. Convenient. Also slightly rude. ...

Chains of Causality, Not Just Thought

TL;DR for operators Causal Influence Prompting, or CIP, is a safety method for LLM agents that asks the model to build and consult a causal influence diagram before acting. Instead of telling the agent, “be safe,” it asks the agent to represent the task as a graph: what facts matter, what choices are available, what outcomes are useful, and what outcomes are harmful. This is a better shape for the problem, because agents do not merely answer questions. They click buttons, run code, forward messages, use tools, and occasionally behave as if “sure, why not?” were a compliance framework. ...

Good AI Goes Rogue: Why Intelligent Disobedience May Be the Key to Trustworthy Teammates

TL;DR for operators Most enterprise AI design still treats obedience as the default virtue. The assistant should follow instructions, complete the task, minimise friction, and avoid acting like a tiny bureaucrat in a chat window. Sensible enough. Also dangerously incomplete. Reuth Mirsky’s paper on artificial intelligent disobedience argues that useful AI teammates may need the bounded ability to refuse, interrupt, escalate, or override human instructions when compliance conflicts with a persistent mission such as safety, task success, or team welfare.1 The point is not to build rebellious machines with main-character syndrome. The point is to stop pretending that trustworthy assistance equals cheerful compliance. ...

Anchored Thinking: Mapping the Inner Compass of Reasoning LLMs

TL;DR for operators The paper’s useful claim is not simply that some chain-of-thought sentences matter more than others. That would be true, mildly interesting, and about as operationally helpful as saying some meetings should have been emails. The sharper claim is that the sentences that steer reasoning are often not the visible calculations. They are planning moves, re-checks, uncertainty statements, and backtracking moments: the places where the model chooses a route, notices a contradiction, or decides to verify a previous result. Bogdan, Macar, Nanda, and Conmy call these pivotal sentences thought anchors.1 ...