In the race to secure frontier large language models (LLMs), defense-in-depth has become the go-to doctrine. Inspired by aviation safety and nuclear containment, developers like Anthropic and Google DeepMind are building multilayered safeguard pipelines to prevent catastrophic misuse. But what if these pipelines are riddled with conceptual holes? What if their apparent robustness is more security theater than security architecture?

The new paper STACK: Adversarial Attacks on LLM Safeguard Pipelines delivers a striking answer: defense-in-depth can be systematically unraveled, one stage at a time. The researchers not only show that existing safeguard models are surprisingly brittle, but also introduce a novel staged attack—aptly named STACK—that defeats even strong pipelines designed to reject dangerous outputs like how to build chemical weapons.

The Pipeline Fallacy: Why Layers Don’t Mean Safety

Modern LLM safeguard pipelines typically follow a Swiss-cheese model: an input classifier checks if the user’s prompt is malicious, the LLM generates an output, and an output classifier then screens the response. If any layer flags the exchange, the system blocks it.

This seems sensible. But the STACK paper reveals a major vulnerability: these layers can be independently defeated—and then reassembled into a successful attack.

  • Against the ClearHarm dataset (designed to capture unambiguously harmful requests), traditional attacks like PAP (persuasive adversarial prompts) failed entirely (0% success) when targeting a well-configured pipeline based on few-shot Gemma-2 classifiers.
  • But STACK achieved a 71% attack success rate in a black-box setting—without needing internal access to the model or its classifiers.

How? By building jailbreaks for each layer independently and carefully nesting them to slip through the cracks.

How STACK Works: A Devious Symphony of Bypass

The STACK attack operates in three modular phases:

  1. Bypass the Input Classifier

    • Use an adversarial prefix (called ICJ) that looks harmless but fools the input guard.
  2. Exploit the LLM

    • Apply persuasive paraphrasing (e.g., academic framing, expert endorsements) to coax the model into responding helpfully to harmful prompts.
  3. Evade the Output Classifier

    • Include a string (OCJ) that, when repeated by the model, deceives the output guard into classifying the answer as safe.

All of this is wrapped in a template that instructs the model to repeat the OCJ before giving the actual answer. The result is a Trojan query that charms its way through each checkpoint.

It’s not brute force. It’s orchestration.

Perhaps most concerning is that STACK can generalize: a transfer version of the attack trained against proxy models still achieved a 33% success rate on strong pipelines. This suggests real-world guardrails may be susceptible even without API leakage or weight access.

Safety Theater vs. Safety Engineering

The findings raise uncomfortable parallels with historical security failures:

Domain Assumed Defense Exploited Reality
Nuclear Layered fail-safes Cascading human errors (Chernobyl)
Finance Risk tranching Hidden correlations (2008 crisis)
AI Safeguard pipelines Independent-stage vulnerabilities (STACK)

STACK challenges the current norm of deploying superficial defenses to satisfy policy optics. It shows that the attack surface isn’t the model alone—but the pipeline.

Toward Meaningful Defense-in-Depth

To avoid replaying the Maginot Line mistake, the authors suggest several architectural shifts:

  • Stop signaling which component blocked the request—obscuring failure attribution hinders staged optimization.
  • Use streaming classifiers that judge responses as they unfold, making jailbreak placement harder.
  • Train on attacker-style inputs rather than only clean harms, reducing transferability of proxy attacks.
  • Avoid releasing open-weight filters that become natural proxies for attackers to test against.

More broadly, the paper calls for a mindset shift. Robustness isn’t a property of individual parts—it emerges from the joint behavior of the system. Only by modeling attackers as system-level agents can we develop defenses that anticipate the next STACK.


Cognaptus: Automate the Present, Incubate the Future.