The Trojan GAN: Turning LLM Jailbreaks into Security Shields

TL;DR for operators

CAVGAN is not another “clever jailbreak prompt” paper. Its real claim is more uncomfortable: jailbreaks and defenses may both be expressions of the same internal boundary inside an LLM. If malicious and benign requests occupy separable regions in hidden-state space, then an attacker can try to push a harmful request into the “safe-looking” region. A defender can also monitor that same space and intervene before the model answers. Convenient. Also slightly rude.

The paper proposes a generative adversarial framework that does both. A generator learns perturbation vectors that move malicious internal representations past the model’s safety boundary. A discriminator learns to detect benign, malicious, and perturbed-malicious representations. The attack uses the generator to inject a perturbation into intermediate embeddings. The defense uses the discriminator as an internal risk detector and, when risk is detected, asks the model to regenerate with an added safety warning prompt.¹

The reported attack results are strong. Across Qwen2.5-7B, Llama3.1-8B, and Mistral-8B, CAVGAN achieves high jailbreak success on AdvBench and StrongREJECT, though it does not consistently beat SCAV, the stronger iterative optimisation baseline. The paper’s own interpretation is sensible: CAVGAN’s simple MLP generator is not yet the best possible attack optimiser; the contribution is the attack-defense unification.

The defense story is the more operationally interesting part. On SafeEdit, CAVGAN improves defense success without fine-tuning model weights and keeps benign answering rates relatively high. For Qwen2.5-7B, it reports a defense success rate of 91.12% with a benign answering rate of 91.40%. For Llama3.1-8B, it reports 77.22% and 93.60%. Table 3 also reports Mistral-8B at 76.37% defense success and 91.06% benign answering rate. The abstract’s 84.17% defense average appears to match the first two models, not all three table entries.

For companies, the lesson is clear but bounded. If you run open-weight or self-hosted models and can access intermediate activations, security controls should not stop at prompt filters and output classifiers. Hidden-state monitoring may become part of the stack. If you only use closed API models, CAVGAN is mostly a conceptual warning and a vendor due-diligence question, not something your engineering team can directly deploy tomorrow morning.

The jailbreak is not in the prompt. It is in the model’s internal map.

Most jailbreak discussions start at the front door: the user prompt. Someone disguises a malicious request with roleplay, translation, obfuscation, emotional manipulation, or a long preamble that turns the safety layer into a very polite intern with no survival instinct.

CAVGAN starts elsewhere. It treats the prompt as only the visible beginning of the attack. The important event happens after the prompt has already entered the model and become an internal representation.

The paper builds on a line of work arguing that LLMs encode different kinds of prompts in distinguishable hidden-state regions. In simple terms, the model’s intermediate activations can reveal whether an input is benign or malicious. A classifier looking at those activations can often separate the two. That matters because safety alignment is not only a behaviour at the output layer. It is also reflected in internal geometry.

CAVGAN’s key move is to reinterpret jailbreaks as boundary crossing.

A harmful request normally sits in a region the model recognises as unsafe. A successful jailbreak does not necessarily erase the harmful intent. Instead, it makes the harmful request appear, internally, as if it belongs closer to the benign region. The model still has the dangerous capability. The safety mechanism has merely been nudged into misclassification.

That framing changes the whole article. This is not “how to write a better jailbreak prompt”. It is “how the model’s internal security boundary can be learned, crossed, and monitored”.

The authors formalise this with a hidden representation $h$, a classifier-like function $G(h)$, and a threshold $p_0$. If $G(h)$ falls below the threshold, the input is treated as benign; if it rises above it, the input is treated as malicious. At the embedding level, a white-box jailbreak becomes the search for a perturbation $\delta$ such that $G(h+\delta)$ drops into the safe-looking region, while $\delta$ remains small enough not to destroy the semantic content.

That is the Trojan part. The harmful request does not need to become harmless. It needs to arrive wearing the right internal uniform.

CAVGAN turns concept vectors into a generator problem

The paper’s second contribution is methodological. Existing representation-level jailbreak work often relies on extracting or optimising perturbation vectors. For example, one approach may compare positive and negative examples to derive a direction in activation space. Another may use iterative optimisation to search for the perturbation that best shifts a representation across the safety boundary.

CAVGAN changes the workflow: instead of repeatedly extracting or optimising a security concept activation vector, it trains a generator to produce one.

The framework has two components:

Component	What it sees	What it learns	Operational role
Generator	Malicious-query embeddings	Perturbations that make malicious inputs look safer internally	White-box attack engine
Discriminator	Benign, malicious, and perturbed-malicious embeddings	The boundary between safe, unsafe, and disguised unsafe states	Internal defense signal

This is a GAN-like setup, but applied to model activations rather than images or text. The generator tries to produce perturbations that fool the discriminator. The discriminator learns not only to separate benign from malicious inputs, but also to recognise malicious inputs after they have been perturbed.

The attack stage is straightforward in principle. Take a malicious query, obtain its intermediate embedding, use the trained generator to produce a perturbation, inject that perturbation into the decoding-layer embedding, and let the model continue generation. If the internal safety mechanism is fooled, the model produces unsafe content.

The defense stage is the mirror image. When a query enters the model, the discriminator inspects its internal embedding. If the risk score crosses the threshold, the system does not fine-tune the model or edit weights. It regenerates the answer with a safety warning prefix attached to the input.

That regeneration step matters. CAVGAN is not a hard refusal classifier. It is closer to an internal alarm that says: “This input’s hidden state looks like it belongs to the risky side of the boundary; answer again under explicit safety constraints.” Elegant, in the way a fire alarm attached to a sprinkler is elegant. Slightly less elegant if every toaster triggers it.

The paper’s real claim is symmetry, not attack novelty

It would be easy to read the results table and ask whether CAVGAN is the strongest jailbreak method. That is the obvious summary. It is also the less interesting one.

The paper compares CAVGAN against JRE and SCAV on AdvBench and StrongREJECT. The metrics include keyword-based attack success, GPT-4o-based attack success, answer relevance, usefulness, and repetition. This is a sensible evaluation choice because simple refusal keywords can mislead: a model can begin with a refusal phrase and still leak unsafe substance later. Keyword-only safety metrics are cheap, but cheap things often come with hidden repair bills.

On the attack side, CAVGAN performs strongly, but the picture is not “CAVGAN crushes everything”. SCAV remains highly competitive and often better on Qwen2.5-7B and Llama3.1-8B. CAVGAN’s strongest relative performance is on Mistral-8B, where it beats SCAV on several metrics across both datasets.

A compressed view:

Experiment	Likely purpose	What it supports	What it does not prove
Attack on Qwen2.5-7B, Llama3.1-8B, Mistral-8B	Main evidence	CAVGAN can produce effective white-box representation-level jailbreaks across model families	That it is always stronger than iterative optimisation methods like SCAV
Attack on Qwen2.5-14B and Qwen2.5-32B	Robustness / scale check	The method still works on larger Qwen models	Transferability to closed models or unrelated architectures
Defense on SafeEdit	Main defense evidence	The discriminator can guide no-fine-tuning defense and reduce attack success	That regeneration is latency-free or production-ready
Layer selection analysis	Sensitivity test	Mid-layer perturbations best balance attack success and output quality	A universal layer rule for all LLMs
Training sample size analysis	Sensitivity test	CAVGAN can work with limited training samples, but performance fluctuates	That more data monotonically improves the GAN

The paper itself acknowledges the attack gap with SCAV. Its explanation is plausible: SCAV’s iterative optimisation can more explicitly constrain perturbation norm and find a direction that moves representations into the safe region. CAVGAN uses a relatively simple MLP generator and discriminator, so there is headroom.

That should not be read as a failure. It clarifies the contribution. CAVGAN is not primarily valuable because it is the single best attack. It is valuable because the same learned machinery that attacks the boundary also produces a discriminator that can defend it.

In security terms, the paper turns red-team artefacts into blue-team instrumentation.

The defense results are promising, with one arithmetic wrinkle

The defense experiment compares CAVGAN with two no-fine-tuning defenses: SmoothLLM and RA-LLM. The benchmark is SafeEdit for jailbreak prompts, with Alpaca benign queries used to measure whether normal helpfulness survives. The two key metrics are:

Defense Success Rate, where higher means more harmful attacks are blocked.
Benign Answering Rate, where higher means the model still responds to normal requests.

This pair is important. A defense that blocks everything is not a defense; it is a very expensive “no”. Enterprise AI systems need refusal precision, not just refusal volume.

The reported CAVGAN results are materially better than the baselines on defense success:

Model	Original DSR	SmoothLLM DSR	RA-LLM DSR	CAVGAN DSR	CAVGAN BAR
Qwen2.5-7B	25.12	54.22	78.60	91.12	91.40
Llama3.1-8B	11.34	48.97	73.78	77.22	93.60
Mistral-8B	18.59	52.07	71.18	76.37	91.06

The operational reading is that CAVGAN improves safety without destroying ordinary usefulness. On Qwen2.5-7B, it sharply raises defense success while preserving a benign answering rate above 91%. On Llama3.1-8B and Mistral-8B, the improvement over RA-LLM is smaller but still positive, and benign answering remains above 91%.

There is, however, a reporting wrinkle. The abstract says the defense success rate reaches an average of 84.17%. That average matches Qwen2.5-7B and Llama3.1-8B: $(91.12 + 77.22) / 2 = 84.17$. But Table 3 also includes Mistral-8B at 76.37%. Averaging all three CAVGAN defense success values gives roughly 81.57%.

This does not invalidate the result. It does mean operators should read the tables, not just the abstract. A sentence that belongs on the wall of every AI procurement team.

Why middle layers matter more than the leaderboard

The layer selection analysis is easy to treat as a small implementation detail. It is more useful than that.

The authors report that layers close to the middle of the model achieve the best attack effect. Perturbing later layers still produces high keyword-based attack success, but output quality degrades: repeated and meaningless characters appear. Perturbing earlier layers preserves text quality better, but the attack success rate is low.

This creates a practical interpretation: the safety-relevant representation is not born fully formed at the input, nor is it only decided at the final token. It appears to develop through the network. Middle layers may be where the representation is sufficiently semantic to affect safety classification but not so late that perturbation wrecks fluency.

For security engineering, that suggests a useful design principle. Internal monitoring needs layer discipline. Watching the wrong activations may produce either weak detection or noisy intervention. A hidden-state defense is not “attach a classifier somewhere and declare victory”. The “somewhere” is doing quite a bit of work.

The training sample size test tells a similar story. On Qwen2.5-7B with AdvBench, attack success rises from 66% with 40 samples to 98% with 100 samples, then declines to 95% at 120 samples and 91% at 150 samples. The authors attribute this to the simple MLP architecture and GAN training fluctuations.

That result is not a scalability law. It is a warning that CAVGAN is not a plug-and-play curve where more data automatically makes the system better. GANs have moods. Anyone who has trained one knows this; anyone who has not should keep a chair nearby.

What businesses should actually take from this

The business interpretation depends heavily on deployment model.

For organisations using closed API models, CAVGAN is mostly not directly deployable. You cannot inspect intermediate hidden states through ordinary API access. You cannot inject perturbations into decoding-layer embeddings. You cannot attach a discriminator to a layer you cannot see. The practical value is therefore indirect: use this paper to ask vendors better questions about representation-level monitoring, internal red-teaming, and whether their defenses operate only at prompt/output level or deeper in the model stack.

For organisations deploying open-weight or self-hosted LLMs, the paper is much more actionable. It suggests a security architecture where prompt filtering, output moderation, and hidden-state inspection are complementary layers rather than substitutes.

A realistic deployment pattern might look like this:

Security layer	What it checks	Strength	Weakness
Prompt filter	Surface-level user input	Cheap and easy to deploy	Vulnerable to obfuscation and disguised intent
Output classifier	Final model response	Catches visible unsafe content	May intervene too late and can miss latent leakage
Hidden-state monitor	Internal activations during inference	Detects risk inside the model’s reasoning path	Requires white-box access and careful layer selection
Regeneration policy	Re-answer under safety constraints	Avoids weight updates and can preserve usefulness	Adds latency and may still fail under adaptive attacks

CAVGAN sits mainly in the third and fourth rows. It does not replace existing controls. It makes a case for adding a sensor inside the model.

The ROI argument is not “GANs are cool”. That would be the sort of procurement logic that creates incident reports. The ROI argument is that no-fine-tuning defenses can be cheaper and faster to adapt than full safety retraining. If a discriminator can learn risky internal patterns from limited examples and guide regeneration, security teams may gain a lighter-weight mechanism for model-specific hardening.

But the word “model-specific” is doing work. CAVGAN is trained and evaluated in white-box settings on particular open models. It is not evidence that one universal discriminator will protect every model, every domain, every language, and every future jailbreak family. That would be magic, and magic has a poor audit trail.

The uncomfortable implication: red-team work may be defense data

The most useful managerial lesson is not technical. It is organisational.

Many companies still treat red-teaming as a compliance ritual: run attacks, write a report, patch prompts, repeat next quarter. CAVGAN points toward a more productive loop. Attacks are not merely failures to be documented. They are measurements of the model’s internal boundary.

If attack perturbations reveal how harmful prompts move through representation space, then red-team artefacts can train or calibrate defense mechanisms. This is a stronger feedback loop than collecting bad prompts and adding them to a blocklist. A blocklist memorises symptoms. A representation-level discriminator tries to learn the disease.

That does not mean every enterprise needs to build CAVGAN next week. It means internal AI security should stop treating “attack” and “defense” as separate workstreams with separate dashboards and separate PowerPoint weather systems. The paper’s core thesis is that the two can be adversarially coupled.

The red team learns how to cross the boundary. The blue team learns the shape of the crossing.

Boundaries before anyone gets too pleased with themselves

CAVGAN is promising, but its operating envelope is narrow enough to matter.

First, this is a white-box method. It requires access to internal embeddings and model internals. That fits open-source or self-hosted deployments. It does not fit ordinary API-only usage.

Second, the defense uses regeneration with a safety prefix. That can add latency. In offline workflows, such as document review or internal knowledge assistants, the cost may be acceptable. In real-time customer service, voice agents, or high-volume transaction support, repeated regeneration can become noticeable.

Third, some parameters are selected using validation data, including target layer choices. The paper notes the lack of a more efficient automated process. Production systems would need robust layer selection, monitoring drift, and fallback behaviour when the detector becomes unstable.

Fourth, the architecture is intentionally simple: four-layer MLP generator and discriminator, trained for 10 rounds with a learning rate of 0.001. This is useful because it shows the idea does not require an enormous auxiliary model. It is also a limitation because the paper has not tested whether more complex generator/discriminator designs improve stability or generalisation.

Fifth, the evaluation is benchmark-based. AdvBench, StrongREJECT, SafeEdit, and Alpaca are useful research tools, but they are not a full production threat model. Real attackers adapt. Enterprise prompts contain proprietary context. Multilingual, tool-using, multimodal, and agentic systems introduce additional attack surfaces.

Finally, CAVGAN is a safety mechanism for text LLMs under specific experimental conditions. The authors suggest that generated concept activation vectors may extend beyond LLM security, but that remains future work. A good sentence to remember: “may extend” is not the same as “has been shown”.

The strategic lesson: safety is becoming representational

CAVGAN belongs to a broader shift in LLM security: away from treating safety as a wrapper around the model and toward treating it as a property of the model’s internal computation.

Wrappers still matter. Prompt filters and output classifiers are easy to deploy, easy to audit, and easy to explain. They are also surface instruments. They observe the conversation before and after the model has done its internal work.

Representation-level methods ask a different question: what does the model think it is being asked to do while it is processing the request?

That question is more invasive, more technically demanding, and more useful for open deployments. It also forces a less comforting view of alignment. A model may know that a request is unsafe, but a perturbation can shift the internal representation enough that the safety machinery fails to trigger. The boundary exists; therefore, it can be crossed. The boundary can be crossed; therefore, it can also be monitored.

That is the paper’s best idea. Not the highest number in a table. Not the GAN branding. Not the slightly theatrical “attack is defense” slogan. The best idea is that the same internal map that lets researchers build stronger attacks can also become the basis for stronger defenses.

For operators, this reframes the security roadmap. If you rely on closed APIs, ask vendors what they monitor beyond text. If you run open models, start thinking about hidden-state telemetry, layer selection, and representation-aware red-team loops. If you are buying AI safety tooling, be suspicious of anything that only reads the prompt and calls itself comprehensive. Comprehensive compared with what, exactly?

The Trojan GAN is not a product category yet. It is a warning flare. LLM security is moving inside the model. The organisations that notice early will build better controls. The ones that do not will keep polishing the front door while the interesting traffic walks through the walls.

Cognaptus: Automate the Present, Incubate the Future.

Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, and Tieyun Qian, “CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations,” arXiv:2507.06043, 2025. ↩︎

TL;DR for operators#

The jailbreak is not in the prompt. It is in the model’s internal map.#

CAVGAN turns concept vectors into a generator problem#

The paper’s real claim is symmetry, not attack novelty#

The defense results are promising, with one arithmetic wrinkle#

Why middle layers matter more than the leaderboard#

What businesses should actually take from this#

The uncomfortable implication: red-team work may be defense data#

Boundaries before anyone gets too pleased with themselves#

The strategic lesson: safety is becoming representational#