Cover image

The Trojan GAN: Turning LLM Jailbreaks into Security Shields

TL;DR for operators CAVGAN is not another “clever jailbreak prompt” paper. Its real claim is more uncomfortable: jailbreaks and defenses may both be expressions of the same internal boundary inside an LLM. If malicious and benign requests occupy separable regions in hidden-state space, then an attacker can try to push a harmful request into the “safe-looking” region. A defender can also monitor that same space and intervene before the model answers. Convenient. Also slightly rude. ...

July 9, 2025 · 15 min · Zelina
Cover image

Agents Under Siege: How LLM Workflows Invite a New Breed of Cyber Threats

TL;DR for operators A support agent reads a customer email. It checks a CRM record. It calls a refund API. It writes a note into long-term memory. It asks another agent to verify policy. Somewhere in that chain, a malicious instruction hides inside a message, document, issue tracker entry, retrieved snippet, schema, or tool response. The model does not need to become “evil”. It only needs to be helpful in the wrong direction. ...

July 1, 2025 · 16 min · Zelina
Cover image

Swiss Cheese for Superintelligence: How STACK Reveals the Fragility of LLM Safeguards

TL;DR for operators Layered safeguards are useful. They are not magic. This paper shows both points, which is inconvenient because the industry prefers safety conclusions that fit on procurement slides. The authors build and evaluate an open-source defence-in-depth pipeline for LLMs: an input classifier screens the user query, a target model produces an answer, and an output classifier screens the answer before the user sees it. Against ordinary black-box jailbreaks, the best version of this pipeline looks strong. A few-shot-prompted Gemma 2 classifier reduces attack success to 0% on ClearHarm, a dataset focused on clearly harmful catastrophic-misuse queries. That is the good news.1 ...

July 1, 2025 · 20 min · Zelina
Cover image

Traces of War: Surviving the LLM Arms Race

TL;DR for operators Reasoning traces are useful. That is the problem. When a frontier reasoning model shows its work, it gives customers more confidence, gives developers more debuggability, and gives downstream applications a richer interface than a bare answer. It also gives competitors and opportunistic scrapers a training asset. The trace is not just an explanation; it is labelled behavioural data from an expensive model. Very polite leakage, in other words. ...

April 19, 2025 · 18 min · Zelina