LLM Safety

Competency Gaps: When Benchmarks Lie by Omission

Scores are comforting. That is their main commercial advantage. A vendor can say its model reaches a certain accuracy on a benchmark, a leaderboard can rank systems neatly, and an internal AI team can report that the new model is “better” than the old one. Everyone gets a number. The procurement slide looks tidy. The risk committee, if mercifully sleepy, moves on. ...

When Guardrails Learn from the Shadows

Labels are expensive. Safety labels are worse. A normal classification project asks annotators to decide whether a customer complaint is urgent, whether a product photo contains a defect, or whether a support ticket belongs to billing. Annoying, yes. Existentially unpleasant, usually no. LLM safety moderation is different. The training examples may include malicious requests, jailbreak attempts, harmful advice, unsafe responses, and edge cases where intent is deliberately hidden under polite phrasing. The annotator must not only read the text but understand what the user is trying to make the model do. In other words, the expensive part is not clicking “safe” or “unsafe.” The expensive part is detecting intent when the user has carefully wrapped it in bubble wrap. ...

Reading the Room? Apparently Not: When LLMs Miss Intent

A user sounds distressed. They ask a factual question. The assistant responds warmly, offers supportive resources, and then supplies the requested information in crisp, well-organized detail. That is the failure pattern. Not because the model was rude. Not because it ignored crisis language. Not because it forgot to add a disclaimer. The problem is more uncomfortable: the model noticed enough to sound caring, but not enough to change what it was willing to provide. ...

Prompt, Probe, Persist: How Multi‑Turn RL Is Rewriting the Jailbreak Playbook

A chatbot rarely fails all at once. In production, failure is usually more boring than cinema. A user asks something borderline. The model refuses. The user rephrases. The model gives a harmless explanation. The user narrows the topic. The model follows the conversation. Then, several turns later, the assistant provides content it should not have provided. No thunder. No villain monologue. Just an interaction history doing what interaction histories do: accumulating context. ...

Uncertainty, But Make It Clinical: How MedBayes‑Lite Teaches LLMs to Say 'I Might Be Wrong'

A hospital does not need a chatbot that sounds certain. It needs a system that knows when certainty would be irresponsible. That sounds obvious until one remembers how most AI demos behave: fluent answer first, caveat somewhere after the damage has already put on shoes. In clinical decision support, this is not a stylistic defect. It is an operating risk. A model can be wrong in many ways, but the most dangerous version is the confidently wrong one: the triage answer that should have been escalated, the medication suggestion that should have been checked, the risk score that looks clean only because the system has no vocabulary for doubt. ...

Value Collision Course: When LLM Alignment Plays Favorites

A support chatbot does not wake up one morning with a worldview. It gets one, slowly, through the dull machinery of product decisions: who labels the data, how many options they can choose from, whether disagreement is kept or ironed flat, and which optimization method gets the privilege of turning messy human judgement into model behaviour. ...

Refusal, Rewired: Why One Safety Direction Isn’t Enough

Safety teams like switches. They are easy to name, easy to diagram, and easy to pretend are under control. For language models, “refusal” has often been treated with roughly that mental model. A harmful prompt enters. Somewhere inside the model, a refusal feature lights up. The model says no. If researchers can identify the feature, they can study it, steer it, strengthen it, or—less comfortably—remove it. ...

Patch, Don’t Preach: The Coming Era of Modular AI Safety

A patch is not a sermon. That distinction matters, because enterprise AI safety has spent too much time sounding like moral philosophy and too little time behaving like maintenance engineering. A deployed model develops a toxicity problem. A customer discovers a jailbreak route. A regulator changes the acceptable boundary for refusal. The usual answer is some combination of “wait for the next model release,” “fine-tune a new variant,” or “wrap it in another brittle instruction.” Very comforting. Also not exactly what one wants when the system is already in production. ...

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

A calendar assistant creates the right meeting. A compliance agent files the right flag. A robotic controller moves the right object. Everyone applauds, because the final state is correct. Then someone checks the logs. The calendar assistant created, deleted, recreated, and re-notified the same meeting. The compliance agent skipped the required policy check and jumped straight to enforcement. The robot got the object into place only after executing a step that would have been unsafe if the power had cut out halfway through. The destination was fine. The route was a mess. In enterprise automation, this is not a philosophical distinction. It is the difference between “the demo worked” and “legal now wants a meeting.” ...

Map Before You Train: Data Cartography to Defuse LLM Memorization

TL;DR for operators Training data does not become risky only after a model has memorised it. It often leaves signals while training is still happening. That is the useful idea behind Generative Data Cartography, or GenDataCarto: track how each pretraining sample behaves during early training, then use that behaviour to decide which data should be kept, up-sampled, down-weighted, or removed.1 The method uses two signals. The first is early loss, which approximates how difficult a sample is. The second is the frequency of “forget events”, where a sample appears learned and later becomes poorly fitted again. In the paper’s framing, frequent forget events are not just training noise. They are a warning that a sample may be unusually influential, repeatedly re-entering the model’s attention like a guest who refuses to leave the meeting. ...