AI Safeguards

TL;DR for operators Layered safeguards are useful. They are not magic. This paper shows both points, which is inconvenient because the industry prefers safety conclusions that fit on procurement slides. The authors build and evaluate an open-source defence-in-depth pipeline for LLMs: an input classifier screens the user query, a target model produces an answer, and an output classifier screens the answer before the user sees it. Against ordinary black-box jailbreaks, the best version of this pipeline looks strong. A few-shot-prompted Gemma 2 classifier reduces attack success to 0% on ClearHarm, a dataset focused on clearly harmful catastrophic-misuse queries. That is the good news.1 ...