Cover image

Prompt, Probe, Persist: How Multi‑Turn RL Is Rewriting the Jailbreak Playbook

A chatbot rarely fails all at once. In production, failure is usually more boring than cinema. A user asks something borderline. The model refuses. The user rephrases. The model gives a harmless explanation. The user narrows the topic. The model follows the conversation. Then, several turns later, the assistant provides content it should not have provided. No thunder. No villain monologue. Just an interaction history doing what interaction histories do: accumulating context. ...

December 9, 2025 · 14 min · Zelina
Cover image

Uncertainty, But Make It Clinical: How MedBayes‑Lite Teaches LLMs to Say 'I Might Be Wrong'

A hospital does not need a chatbot that sounds certain. It needs a system that knows when certainty would be irresponsible. That sounds obvious until one remembers how most AI demos behave: fluent answer first, caveat somewhere after the damage has already put on shoes. In clinical decision support, this is not a stylistic defect. It is an operating risk. A model can be wrong in many ways, but the most dangerous version is the confidently wrong one: the triage answer that should have been escalated, the medication suggestion that should have been checked, the risk score that looks clean only because the system has no vocabulary for doubt. ...

November 22, 2025 · 16 min · Zelina
Cover image

Value Collision Course: When LLM Alignment Plays Favorites

A support chatbot does not wake up one morning with a worldview. It gets one, slowly, through the dull machinery of product decisions: who labels the data, how many options they can choose from, whether disagreement is kept or ironed flat, and which optimization method gets the privilege of turning messy human judgement into model behaviour. ...

November 20, 2025 · 14 min · Zelina
Cover image

Refusal, Rewired: Why One Safety Direction Isn’t Enough

Safety teams like switches. They are easy to name, easy to diagram, and easy to pretend are under control. For language models, “refusal” has often been treated with roughly that mental model. A harmful prompt enters. Somewhere inside the model, a refusal feature lights up. The model says no. If researchers can identify the feature, they can study it, steer it, strengthen it, or—less comfortably—remove it. ...

November 15, 2025 · 17 min · Zelina
Cover image

Patch, Don’t Preach: The Coming Era of Modular AI Safety

A patch is not a sermon. That distinction matters, because enterprise AI safety has spent too much time sounding like moral philosophy and too little time behaving like maintenance engineering. A deployed model develops a toxicity problem. A customer discovers a jailbreak route. A regulator changes the acceptable boundary for refusal. The usual answer is some combination of “wait for the next model release,” “fine-tune a new variant,” or “wrap it in another brittle instruction.” Very comforting. Also not exactly what one wants when the system is already in production. ...

November 12, 2025 · 18 min · Zelina
Cover image

Paths > Outcomes: Measuring Agent Quality Beyond the Final State

A calendar assistant creates the right meeting. A compliance agent files the right flag. A robotic controller moves the right object. Everyone applauds, because the final state is correct. Then someone checks the logs. The calendar assistant created, deleted, recreated, and re-notified the same meeting. The compliance agent skipped the required policy check and jumped straight to enforcement. The robot got the object into place only after executing a step that would have been unsafe if the power had cut out halfway through. The destination was fine. The route was a mess. In enterprise automation, this is not a philosophical distinction. It is the difference between “the demo worked” and “legal now wants a meeting.” ...

October 2, 2025 · 15 min · Zelina
Cover image

Map Before You Train: Data Cartography to Defuse LLM Memorization

TL;DR for operators Training data does not become risky only after a model has memorised it. It often leaves signals while training is still happening. That is the useful idea behind Generative Data Cartography, or GenDataCarto: track how each pretraining sample behaves during early training, then use that behaviour to decide which data should be kept, up-sampled, down-weighted, or removed.1 The method uses two signals. The first is early loss, which approximates how difficult a sample is. The second is the frequency of “forget events”, where a sample appears learned and later becomes poorly fitted again. In the paper’s framing, frequent forget events are not just training noise. They are a warning that a sample may be unusually influential, repeatedly re-entering the model’s attention like a guest who refuses to leave the meeting. ...

September 4, 2025 · 16 min · Zelina
Cover image

Unsafe at Any Bit: Patching the Safety Gaps in Quantized LLMs

TL;DR for operators Quantizing an LLM is not a harmless cost-saving step. It changes the model, and the paper analysed here shows that those changes can weaken safety even when familiar utility scores still look respectable. That is the uncomfortable part: the dashboard can say “performance preserved” while the model has become more willing to comply with harmful requests. Very efficient. Very modern. Very easy to miss. ...

June 26, 2025 · 20 min · Zelina