LLM Security

When the Paper Talks Back: Lost in Translation, Rejected by Design

A PDF is supposed to sit quietly. It may contain claims, equations, tables, and occasionally an appendix long enough to test a reviewer’s commitment to science. It is not supposed to negotiate with the system judging it. That assumption becomes unreliable once a document enters an LLM-based workflow. To the human reader, a sentence rendered in white text may be invisible. To a text-extraction pipeline, it can remain perfectly legible—and potentially indistinguishable from an instruction the model is expected to follow. ...

Judo, Not Armor: Strategic Deflection as a New Defense Against LLM Jailbreaks

TL;DR for operators Most LLM safety systems still assume that, when a model sees a harmful request, the correct behaviour is refusal. That works until the attacker stops arguing with the prompt and starts interfering with generation itself. The paper behind this article, Strategic Deflection: Defending LLMs from Logit Manipulation, proposes SDeflection: a fine-tuning method that teaches a model to answer in a safe, topic-adjacent way rather than relying only on explicit refusal language.1 The model does not provide harmful instructions. It redirects the subject toward harmless information that is close enough to the original topic to survive attacks that try to force compliance-style openings. ...

The Trojan GAN: Turning LLM Jailbreaks into Security Shields

TL;DR for operators CAVGAN is not another “clever jailbreak prompt” paper. Its real claim is more uncomfortable: jailbreaks and defenses may both be expressions of the same internal boundary inside an LLM. If malicious and benign requests occupy separable regions in hidden-state space, then an attacker can try to push a harmful request into the “safe-looking” region. A defender can also monitor that same space and intervene before the model answers. Convenient. Also slightly rude. ...

Agents Under Siege: How LLM Workflows Invite a New Breed of Cyber Threats

TL;DR for operators A support agent reads a customer email. It checks a CRM record. It calls a refund API. It writes a note into long-term memory. It asks another agent to verify policy. Somewhere in that chain, a malicious instruction hides inside a message, document, issue tracker entry, retrieved snippet, schema, or tool response. The model does not need to become “evil”. It only needs to be helpful in the wrong direction. ...

Swiss Cheese for Superintelligence: How STACK Reveals the Fragility of LLM Safeguards

TL;DR for operators Layered safeguards are useful. They are not magic. This paper shows both points, which is inconvenient because the industry prefers safety conclusions that fit on procurement slides. The authors build and evaluate an open-source defence-in-depth pipeline for LLMs: an input classifier screens the user query, a target model produces an answer, and an output classifier screens the answer before the user sees it. Against ordinary black-box jailbreaks, the best version of this pipeline looks strong. A few-shot-prompted Gemma 2 classifier reduces attack success to 0% on ClearHarm, a dataset focused on clearly harmful catastrophic-misuse queries. That is the good news.1 ...

Traces of War: Surviving the LLM Arms Race

TL;DR for operators Reasoning traces are useful. That is the problem. When a frontier reasoning model shows its work, it gives customers more confidence, gives developers more debuggability, and gives downstream applications a richer interface than a bare answer. It also gives competitors and opportunistic scrapers a training asset. The trace is not just an explanation; it is labelled behavioural data from an expensive model. Very polite leakage, in other words. ...