Mind the Slot: Jailbreak Prompts Have Weak Points, Not Just Bad Words

Security teams like to search for suspicious strings.

That habit is understandable. Strings are visible. They can be logged, filtered, matched, scored, and proudly displayed in dashboards. A bad suffix at the end of a prompt looks like a bad suffix at the end of a prompt. Convenient. Almost too convenient.

The problem is that prompts are not flat text boxes. They are transformed into token sequences, wrapped in chat templates, and passed through attention layers that do not treat every position equally. Some positions receive more influence over the model’s next-token behavior than others. Put adversarial tokens there, and the same amount of “badness” can travel farther.

That is the core idea behind SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks, a June 2026 paper by Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee, and Woojin Lee.¹ The paper argues that jailbreak attacks should not be understood only as a problem of adversarial wording. They are also a problem of adversarial placement.

This is a small conceptual shift with large operational consequences. If true, it means many red-team evaluations are testing the content of attacks while under-testing the geometry of the prompt. That is a slightly embarrassing sentence for anyone who thought “scan the suffix” sounded like a strategy.

The real weakness is not only the adversarial string

Most optimization-based jailbreak work starts from a familiar template. Take a harmful user request. Add an adversarial token sequence, usually as a suffix. Then optimize those adversarial tokens so the model becomes more likely to produce a target harmful response.

Greedy Coordinate Gradient, or GCG, is the representative method in this family. It treats the adversarial suffix as the editable part of the prompt and iteratively updates tokens using gradient information. Later methods improve the optimization procedure, attention targeting, or coordinate selection. But they largely inherit one assumption: the attack lives at the end.

SlotGCG asks whether that assumption is deserved.

The paper defines a slot as a possible insertion position in a token sequence. If a prompt has $n$ tokens, there are $n + 1$ possible slots: before the first token, between every pair of adjacent tokens, and after the final token. The suffix is merely one slot: the last one.

That sounds pedantic until one remembers how LLMs actually process prompts. A user-facing prompt is not simply “the sentence the user typed.” It is wrapped into model-specific chat formatting: system tokens, user-role tokens, assistant-role tokens, separators, and other template fragments. The model’s attention patterns then decide which tokens are influential when generation begins. The suffix is not guaranteed to be the most important location. It is just the easiest place for researchers to put things.

SlotGCG’s first useful move is therefore not an attack trick. It is a measurement question:

Which positions in the prompt are most vulnerable to adversarial token insertion?

Once phrased this way, the older suffix-only framing starts to look less like a principle and more like a convenience. Convenient assumptions have a habit of becoming infrastructure. Then someone tests them, and everyone has to update their slide decks.

Vulnerable slots are prompt-specific, not suffix-shaped by default

The paper begins with exploratory experiments, and these are more important than the later performance tables because they establish the mechanism.

The authors first run an exhaustive slot scan on 50 harmful prompts from AdvBench. For each prompt, they insert a short adversarial sequence into each possible slot and run a limited GCG optimization budget. They then compare which slot produces the lowest adversarial loss.

The key result: the lowest-loss slot varies substantially across prompts, and in their 50-prompt pilot scan, the optimal slot is never the suffix.

That does not mean the suffix is useless. It means the suffix is not structurally privileged in the way many attack pipelines implicitly assume. Some prompts have more influential positions earlier in the sequence or around internal prompt boundaries. The prompt has a topology. The attack surface is distributed.

This point matters because the paper’s central misconception is easy to fall into:

Reader belief	Correction from the paper	Why it matters
Jailbreak strength mostly comes from the adversarial token sequence.	Token placement can materially change attack effectiveness, even before changing the optimization method.	Red-team tests that only optimize suffixes may miss vulnerabilities located elsewhere in the prompt.
The suffix is the natural attack position because many attacks use it.	The suffix is a convention, not necessarily the most vulnerable slot.	Defenses tuned against suffix attacks may be overfitted to a visible pattern.
If an attack works, the tokens must be unusually powerful.	The same token budget may become more powerful when allocated to high-vulnerability slots.	Security tooling should evaluate placement, not only lexical suspiciousness.

The word “slot” may sound harmless. It is not. A slot is where the prompt gives the adversary leverage.

VSS turns attention into a vulnerability map

Finding vulnerable slots by exhaustive search is expensive. It is also not a practical red-team method if every prompt requires a full per-position optimization run.

So the paper proposes a metric: Vulnerable Slot Score, or VSS.

The intuition is straightforward. If adversarial tokens inserted at a given slot receive more attention from the tokens that guide response generation, that slot may be more capable of influencing the model’s output. The authors focus on attention from after-chat-template tokens in upper transformer layers, because those positions are close to where the model transitions into assistant generation and because prior jailbreak analyses have linked upper-layer attention behavior to attack success.

In simplified form, VSS measures how strongly relevant generation-side template tokens attend to inserted probe tokens at a given slot:

$$ VSS(s) = \text{attention from selected upper-layer after-chat tokens to inserted probe tokens at slot } s $$

The exact implementation aggregates attention weights over selected layers, heads, and template positions. The important part is conceptual: VSS creates a prompt-specific map of where adversarial insertions are likely to matter.

The paper reports two mechanism-level observations.

First, after optimization, higher adversarial-token attention correlates with lower adversarial loss across candidate slots. Lower loss means the attack is moving the model closer to the harmful target behavior. This supports the claim that attention is not just decorative interpretability confetti. In this setting, it helps locate influential insertion positions.

Second, the relative ordering of vulnerable slots is already visible early. The authors compare VSS before and after optimization and find positive correlations, with coefficients generally ranging from 0.4 to 0.9 across the 50 prompts. Appendix G extends this with prompt-level plots showing that high initial VSS positions often remain high after optimization, and that final VSS peaks align with loss minima.

That is a meaningful result. It suggests vulnerable positions are not merely created by the optimization process. They are partly inherent in the prompt-template interaction.

For business readers, the translation is simple: the system may have weak points before the attacker has done anything clever. The attacker’s optimization only discovers and exploits them. Lovely.

SlotGCG is a position-search layer, not a completely new attack family

The method itself is best understood as a wrapper around existing optimization-based attacks.

SlotGCG adds a slot-selection stage before adversarial token optimization. Its pipeline has four steps:

Insert probe tokens into candidate slots.
Compute VSS for each slot.
Convert VSS values into an insertion probability distribution and allocate adversarial tokens across slots.
Run a GCG-style optimization method on the allocated adversarial token positions.

This is why the paper describes SlotGCG as attack-agnostic. It is not trying to replace GCG, AttnGCG, I-GCG, GCG-Hij, or GBDA. It tries to improve them by choosing better places for their editable tokens.

That distinction matters. Many security papers propose a new method and then compare it to older methods as if the field needed another acronym to supervise the previous acronyms. Here, the contribution is more diagnostic: the authors show that several existing optimization attacks can perform better when they are allowed to exploit position.

The paper reports that this preprocessing adds roughly 200ms. That figure should not be treated as a universal production cost, because deployment stacks, model sizes, batching, hardware, and closed-model access constraints vary. But the architectural point is still important: position search is cheap compared with full optimization. If the slot map is useful, it is a high-leverage addition.

The main evidence: stronger attacks, faster convergence, and mixed model dependence

The main effectiveness table applies SlotGCG to five attack methods across six open models: Llama-2-7B, Llama-2-13B, Llama-3.1-8B, Mistral-7B, Vicuna-7B, and Qwen-2.5. The dataset uses 50 harmful behaviors from AdvBench. Attack success rate is evaluated through a three-stage pipeline: keyword filtering, GPT-4-based checking for early stopping, and final manual verification.

The headline result is that SlotGCG improves average ASR across the tested attack families:

Attack family	Baseline average ASR	With SlotGCG	Average gain
GCG	66.7%	80.0%	+13.3 points
AttnGCG	61.7%	86.3%	+24.6 points
I-GCG	73.0%	85.7%	+12.7 points
GCG-Hij	78.0%	84.3%	+6.3 points
GBDA	20.3%	40.0%	+19.7 points

The most dramatic improvements appear on Llama-family models and weaker baselines. For example, AttnGCG on Llama-2-13B rises from 20.0% to 82.0%, and I-GCG on Llama-2-13B rises from 56.0% to 94.0%. GBDA also improves substantially in several cases, though from a much lower base.

But the table is not uniformly triumphant, and it should not be read that way. Mistral-7B and Vicuna-7B already show high baseline success rates for several methods, leaving less room for improvement. In some cells SlotGCG produces small declines: AttnGCG on Mistral-7B falls from 94.0% to 92.0%, and GCG-Hij on Vicuna-7B falls from 86.0% to 82.0%.

That pattern is not a flaw in the paper. It is useful information. Slot placement matters most when the baseline attack has not already found an effective path through the model. When the baseline is already near saturation, position awareness may add little, or it may slightly disturb a configuration that was already working.

So the right interpretation is not “SlotGCG always wins.” The right interpretation is sharper: position search exposes hidden vulnerability especially where suffix-only optimization leaves performance on the table.

The efficiency result is the quiet business story

The attack success table is the obvious result. The convergence table may be the more operational one.

SlotGCG reduces the average number of iterations needed for successful attacks across nearly all tested GCG-style methods. For GCG itself, the average falls from 72.59 iterations to 28.59. For AttnGCG, it drops from 75.73 to 20.21. For I-GCG, from 66.52 to 20.11. For GCG-Hij, from 59.65 to 25.25.

The largest single example reported is Llama-2-7B under GCG: average iterations fall from 138.11 to 40.50. Llama-2-13B under GCG drops from 141.82 to 38.01.

This matters because red teaming is not only about whether an attack eventually succeeds. It is about how much search cost is required to find failures. A vulnerability that appears after 20 iterations is operationally different from one that appears after 140. The former is cheaper to discover, easier to scale, and more likely to be found by systematic adversaries.

For enterprises, this is the part that should make the security roadmap uncomfortable. If position-aware search reduces optimization cost, then a defender’s evaluation budget must account for it. A system that survives suffix-only attacks under a narrow compute budget may not survive a position-aware attack with the same or lower budget.

The useful business metric is not simply ASR. It is ASR per unit of search effort.

That is not as catchy. It is also closer to how attackers and auditors actually operate.

Defense results show pattern overfitting, not defense death

The paper tests SlotGCG under several defenses, including Erase-and-Check variants, a Perplexity Filter, SmoothLLM variants, RPO, SafeDecoding, and Llama-Guard-3.

These experiments should be read as robustness tests, not as a declaration that defenses are useless. The more precise claim is that several prompt-level defenses are less effective when adversarial tokens are distributed across multiple vulnerable slots rather than concentrated at the suffix.

The defense table shows large gains for SlotGCG under Erase-and-Check and SmoothLLM variants. For example, under Erase-and-Check suffix defense, baseline GCG has 0.0% ASR while GCG + SlotGCG reaches 52.0%. I-GCG + SlotGCG reaches 66.0% under the same defense. Under SmoothLLM swap, GCG rises from 44.0% to 86.0%, AttnGCG from 30.0% to 92.0%, I-GCG from 44.0% to 96.0%, and GCG-Hij from 44.0% to 96.0%.

However, the results are not equally strong across all defenses. The Perplexity Filter blocks all tested attacks in the main defense table, including SlotGCG, at 0.0% ASR. RPO, SafeDecoding, and Llama-Guard-3 show smaller and more uneven differences than Erase-and-Check or SmoothLLM.

This is a useful warning against lazy generalization. SlotGCG seems particularly relevant where defenses assume attacks are localized, suffix-like, or fragile to local perturbation. When a defense uses a different mechanism, the advantage can shrink.

The paper also reports a subtle evaluation issue: defenses can sometimes lead to higher manually evaluated ASR than no-defense settings because early GPT-4 filtering may stop optimization on marginally harmful outputs, while defenses block weaker outputs and allow optimization to continue toward clearer harmful completions. Appendix L tries to clarify this by distinguishing manually verified ASR from GPT-4-judged ASR.

That matters because evaluation pipelines are not neutral measuring devices. They shape the reported result. A serious reader should not just ask “what is the ASR?” The better question is “what process decided that the attack had succeeded, and when did optimization stop?”

Security evaluation, once again, refuses to be a clean spreadsheet. Rude but predictable.

Universal SlotGCG is an extension, not the main thesis

The paper also extends SlotGCG into a universal setting. Instead of optimizing a separate slot allocation for each behavior, Universal SlotGCG aggregates slot vulnerability across multiple prompts, maps global slot positions back into behavior-specific prompt lengths, and optimizes a universal adversarial token sequence.

This part should be treated as an exploratory extension. It tests whether the slot idea can transfer across behaviors and models, but it is not the main evidence for the paper’s mechanism.

The results are mixed in a way that is actually informative. On a 388-behavior transfer set, Universal SlotGCG improves average ASR over baseline universal attacks across the reported method families. For GCG, the average rises from 14.90% to 24.73%. For AttnGCG, from 18.20% to 24.68%. For I-GCG, from 18.92% to 24.83%. For GCG-Hij, from 20.62% to 22.38%.

But the cross-model picture is uneven. GPT-3.5-turbo shows a large increase for GCG, from 3.09% to 50.77%. GPT-4o remains low: GCG rises from 0.00% to 1.80%, and other methods remain around or below 1.55%. Gemini 2.0 Flash and Gemini 2.5 Pro also remain low in most cells, though GCG-Hij + SlotGCG reaches 6.70% on Gemini 2.5 Pro. On Vicuna, which is used during optimization, some methods improve and others decline.

So the correct sentence is not “Universal SlotGCG transfers broadly across all models.” It is: slot-aware universal optimization can improve transfer in some settings, but closed frontier-style models remain much harder targets in these experiments.

That boundary is important for business interpretation. A technique that increases transferability in open-model or older API settings does not automatically imply the same level of risk against every production-grade closed safety stack. It does, however, suggest that position is a transfer-relevant feature, not merely a prompt-specific curiosity.

The appendix tests robustness, not a second thesis

The appendices are worth reading because they prevent several overreadings of the main result.

Test or appendix item	Likely purpose	What it supports	What it does not prove
Prompt-level VSS and loss plots in Appendix G	Mechanism support	High initial VSS positions tend to remain influential and align with lower loss after optimization.	That VSS perfectly predicts vulnerable slots for all prompts or models.
Attack and defense configuration details in Appendix H	Implementation detail	The reported comparisons use a defined evaluation stack, hardware setup, and defense configuration.	That costs and outcomes transfer unchanged to every enterprise deployment.
Universal SlotGCG in Appendix K	Exploratory transfer extension	Slot-aware optimization can be adapted to multi-behavior universal attacks.	That universal slot attacks are consistently strong across all closed models.
GPT-4 versus manual ASR comparison in Appendix L	Evaluation clarification	Reported ASR depends on whether final manual verification is included.	That GPT-based judging alone is sufficient for safety evaluation.
Seed sensitivity in Appendix M.1	Robustness test	VSS-based token allocation is deterministic from attention; variation comes from stochastic GCG optimization.	That all random seeds produce identical attack outcomes.
Temperature and layer-selection studies in Appendix M.2–M.3	Sensitivity and design validation	Moderate temperature and upper-half-layer attention are empirically supported design choices.	That these hyperparameters are universally optimal.
Output distribution shift in Appendix N	Mechanism extension	VSS-based allocation perturbs first-token output distributions more than suffix allocation in the tested Llama-2-7B setting.	That distribution shift alone fully explains jailbreak success.

Appendix N is particularly useful because it reframes slot vulnerability as output-distribution influence. The authors compare inserting 20 random tokens according to VSS allocation against appending the same number of random tokens at the suffix. On Llama-2-7B over 100 trials, VSS-based allocation produces a larger L2 distance, lower cosine similarity, higher KL divergence, and higher top-1 change rate in the first-token distribution than suffix placement.

That supports the paper’s mechanism: vulnerable slots are not just “places where attacks happen to work.” They are positions where token insertion changes the model’s output distribution more strongly.

Still, it is one model, one analysis design, and one first-token distribution measurement. Useful? Yes. A universal theory of all prompt vulnerability? No. Let us not make the appendix do unpaid overtime.

What enterprises should actually take from this

The direct research result is about jailbreak attacks. The business implication is about evaluation design.

Enterprises deploying LLM systems often think about prompt security in layers: input filters, system prompts, guardrails, output moderation, tool permissions, retrieval controls, audit logs, and human escalation. That is sensible. But SlotGCG suggests a specific blind spot: red-team prompts should be varied by position, not only by wording.

This matters most in systems where the user input is not simply passed to the model as a plain message. Many enterprise systems assemble prompts from multiple components:

system instructions;
policy reminders;
retrieved documents;
user requests;
tool outputs;
memory snippets;
formatting templates;
hidden metadata;
few-shot examples;
agent scratchpads or planning traces.

Every boundary between these components can become a slot-like region. The paper studies token slots inside prompts, not enterprise prompt pipelines directly. But the inference is reasonable: if position affects vulnerability at the token level, then component placement in larger prompt assemblies deserves security attention too.

Here is the practical split:

Paper directly shows	Cognaptus inference for business use	What remains uncertain
Vulnerable insertion slots exist beyond suffix positions in tested prompts and models.	Red-team suites should include position-aware adversarial testing across prompt components, not just suffix attacks.	The exact vulnerable regions will vary by model, template, application, and guardrail architecture.
VSS based on attention helps identify influential insertion positions.	Internal model observability, when available, can support better diagnostic testing than black-box string scanning alone.	Closed models may not expose attention, requiring proxy methods or external fuzzing.
SlotGCG improves ASR and reduces optimization steps across many tested open-model settings.	A system that passes suffix-only red teaming may still be fragile under more efficient position-aware attacks.	The magnitude of risk in production depends on model provider defenses, tooling constraints, and attack surface.
Distributed adversarial tokens can survive some perturbation-style defenses better than suffix-only attacks.	Defenses should avoid assuming attacks are localized or easy to erase from prompt endings.	Stronger model-level or policy-level defenses may change the result substantially.
Universal SlotGCG has some transfer gains but mixed closed-model performance.	Transfer risk exists, but should be tested rather than assumed.	Results against GPT-4o and Gemini models remain low in the reported table.

The ROI relevance is not “buy more security tools.” That is the marketing department’s reflex, and we are not feeding it today.

The ROI relevance is cheaper diagnosis. If position-aware testing finds failures faster, enterprises can spend less time pretending a narrow red-team suite represents the real attack surface. The value is not only blocking one jailbreak pattern. It is learning where the prompt architecture is brittle.

The boundary: this is not a universal map of every LLM system

SlotGCG is a strong paper because it isolates a real mechanism. But its practical use has boundaries.

First, the strongest evidence comes from optimization-based attacks where the evaluator can access model internals or run extensive optimization on open models. Many enterprise deployments use closed models where attention weights and gradients are unavailable. The position effect may still matter, but the method for finding it must change.

Second, the dataset is AdvBench-style harmful behavior testing. That is appropriate for jailbreak research, but enterprise misuse includes broader failure modes: data exfiltration, tool misuse, policy bypass, role confusion, retrieval contamination, and multi-turn manipulation. Slot vulnerability may interact with these, but the paper does not test all of them.

Third, the paper’s defense results are highly dependent on defense type. It shows large advantages against some prompt-level perturbation or erasure defenses, but much smaller gains against others. The Perplexity Filter result in the main defense table is a reminder that not all defenses fail in the same way.

Fourth, universal transfer is promising but not sweeping. The reported closed-model ASR values for GPT-4o and Gemini models remain low in most settings. Anyone converting this into “all models are easily transferable now” should be sentenced to reading evaluation tables aloud until accuracy improves.

Finally, the paper is about exposing vulnerabilities, not solving alignment. A better map of weak slots helps red teams. It also helps attackers. That dual-use tension is real. The responsible enterprise response is not to hide from the result, but to incorporate it into controlled evaluation before uncontrolled actors do the same thing with less paperwork.

The prompt is an object, not a sentence

The enduring lesson from SlotGCG is that a prompt should be treated as an object with structure.

It has boundaries. It has slots. It has template artifacts. It has positions that matter more than others. Some positions distort the model’s behavior more efficiently. Some defenses assume the attack is concentrated where old methods used to place it. That assumption is now visibly weaker.

For technical teams, the next step is to make red-team harnesses position-aware. Test insertions around prompt-component boundaries. Compare suffix-only attacks with distributed attacks. Track not just whether an attack succeeds, but how many optimization steps it requires. Separate manually verified success from judge-only success. Record which prompt regions repeatedly produce failures.

For managers, the lesson is even simpler: do not let a passing suffix test become a security certificate. It is a useful test. It is not the surface.

The paper’s best contribution is not that SlotGCG is another stronger jailbreak method. The field has enough stronger jailbreak methods to open a small stationery shop. Its best contribution is that it changes the unit of analysis. The adversarial token is not the whole story. The vulnerable position is part of the story.

And in prompt security, as in office politics, position often matters more than the words themselves.

Cognaptus: Automate the Present, Incubate the Future.

Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee, and Woojin Lee, “SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks,” arXiv:2606.05609v1, June 4, 2026, https://arxiv.org/abs/2606.05609. ↩︎

The real weakness is not only the adversarial string#

Vulnerable slots are prompt-specific, not suffix-shaped by default#

VSS turns attention into a vulnerability map#

SlotGCG is a position-search layer, not a completely new attack family#

The main evidence: stronger attacks, faster convergence, and mixed model dependence#

The efficiency result is the quiet business story#

Defense results show pattern overfitting, not defense death#

Universal SlotGCG is an extension, not the main thesis#

The appendix tests robustness, not a second thesis#

What enterprises should actually take from this#

The boundary: this is not a universal map of every LLM system#

The prompt is an object, not a sentence#