Adversarial Robustness

The Refusal Rate That Refuses to Reassure

TL;DR for operators The reassuring headline is that both evaluated frontier models rejected most automated jailbreak attempts. The operationally useful headline is that they still produced 1,620 and 702 panel-confirmed harmful completions, respectively, across every top-level harm category in the benchmark.1 The strongest adaptive attack succeeded on 11.51% of attempts against one model and 6.10% against the other. Static encodings and familiar jailbreak templates, by contrast, were almost entirely neutralised. ...

The Path of Least Assurance: Why AI Reliability Lives Between the Steps

TL;DR for operators AI reliability is increasingly a process problem, not an answer-checking problem. Three recent arXiv papers make that point from very different angles. MoCo-EA shows that adversarial examples are not merely isolated malicious pixels lurking in the shrubbery; they can lie along continuous, optimisable paths.1 ConceptAgent shows that erasing a concept from a diffusion model may disrupt the early text-to-image link while leaving later trajectory dynamics available for concept re-entry.2 BlueFin shows that LLM agents doing finance spreadsheet work fail in ways that only appear when you inspect formulas, recalculation behaviour, workbook mutations, tool choices, and whether the output helps a human analyst do useful work.3 ...

Reasoning Under Pressure: When Smart Models Second-Guess Themselves

A customer challenges the answer. Not with new evidence. Not with a better calculation. Just with one of those tiny conversational needles: Are you sure? Or worse: Most people disagree with this. Or the classic office-friendly version: As an expert, I’m confident you are wrong. A human analyst might pause, check the source, and decide whether the objection contains actual information. A large reasoning model may also pause. It may even produce several polished paragraphs of careful reconsideration. Then, occasionally, it abandons the correct answer. ...

SAFE Enough to Think: Federated Learning Comes for Your Brain

Hospitals do not usually wake up excited to pool brain data. Neither do device vendors, rehabilitation centers, or anyone with a lawyer who has read a privacy regulation without falling asleep halfway through. EEG data is useful precisely because it is personal. That is also why centralizing it is awkward. This is the practical tension behind SAFE, short for Secure and Accurate Federated Learning, a proposed framework for EEG-based brain-computer interfaces, or BCIs.1 The paper is not interesting because it says “federated learning protects privacy.” That line has already been printed on enough PowerPoint slides to qualify as industrial wallpaper. The interesting part is that the authors treat federated learning as only one piece of the problem. ...

Scale Fail: How Downsampling Becomes an Adversarial Backdoor for VLMs

Scale Fail: How Downsampling Becomes an Adversarial Backdoor for VLMs Resize. It is one of those engineering verbs that sounds too boring to threaten anyone. A user uploads a screenshot, invoice, inspection photo, interface capture, medical form, or product image. The system resizes it. The model reads it. The workflow moves on. ...