Adversarial AI

Mind the Drift: Why Stateful AI Guardrails Beat Bigger Models

A chatbot rarely fails in one clean dramatic explosion. More often, it is nudged. First, the user asks for a harmless explanation. Then a role-play frame. Then a historical analogy. Then a translation. Then a “purely fictional” operational detail. By the time the final request arrives, the model has already been walked across the room. The last prompt is not the attack. It is the receipt. ...

Judo, Not Armor: Strategic Deflection as a New Defense Against LLM Jailbreaks

TL;DR for operators Most LLM safety systems still assume that, when a model sees a harmful request, the correct behaviour is refusal. That works until the attacker stops arguing with the prompt and starts interfering with generation itself. The paper behind this article, Strategic Deflection: Defending LLMs from Logit Manipulation, proposes SDeflection: a fine-tuning method that teaches a model to answer in a safe, topic-adjacent way rather than relying only on explicit refusal language.1 The model does not provide harmful instructions. It redirects the subject toward harmless information that is close enough to the original topic to survive attacks that try to force compliance-style openings. ...

Outrun the Herd, Not the Lion: A Smarter AI Strategy for Business Games

TL;DR for operators Search-contempt is not “AI plays worse so it learns more”. That would be the lazy interpretation, and business strategy already has enough lazy interpretations wearing expensive shoes. The paper introduces a hybrid MCTS method for AlphaZero-like self-play systems. It behaves like standard PUCT search for the player to move, but at opponent nodes it eventually freezes the opponent’s visit distribution after a threshold, $N_{scl}$, and samples from that frozen distribution rather than constantly updating it toward stronger play.1 The effect is subtle but important: the system stops assuming the opponent will always improve its response with more search. ...