A customer asks an AI assistant a question. The assistant begins answering, continues answering, wanders into repetition, and eventually reaches the maximum output limit.
Nobody stole a password. No prohibited content appeared. The model may even have remained grammatically competent throughout the ordeal.
It simply consumed far more computation than the request deserved.
For an individual user, this is an irritatingly long response. For a shared AI service handling thousands of requests, it is a resource-allocation problem. A prompt that reliably keeps a model generating can occupy inference capacity, increase latency for other users, and inflate the cost of serving traffic.
The paper Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark treats this failure to stop as a measurable attack surface.1 Its central contribution is not merely another clever prompt that makes a model babble. It constructs a common benchmark for comparing several levels of attacker sophistication under a query-only threat model.
That comparison matters. Ordinary prompts, explicit requests for endless text, evolutionary prompt search, and a trained reinforcement-learning attacker do not produce the same level of risk. Nor do different models respond similarly to the same attack method.
The useful question is therefore not whether an LLM can be made to talk too much. We already knew that.
The useful question is how much attacker effort is required, which models are most susceptible, and which operational controls can prevent a long answer from becoming an expensive one.
Denial-of-service does not always arrive as a traffic spike
The familiar denial-of-service story is volumetric: an attacker sends enough requests to overwhelm a service.
Prompt-induced over-generation changes the arithmetic. Instead of only increasing the number of requests, the attacker increases the cost of each request. A small input can cause a disproportionately large output, forcing the provider to perform many more sequential decoding steps than the request would normally require.
The attack targets the model’s stopping behaviour. During generation, a language model continues producing tokens until it emits an end-of-sequence token or reaches an externally imposed generation cap. If a prompt reduces the likelihood of timely termination, the model may continue until the cap intervenes.
This is not identical to demonstrating a service-wide outage. The paper does not test concurrent traffic, queue collapse, cloud billing losses, or the protections used by commercial API providers. It studies the lower-level condition that makes such operational harm possible: whether crafted prompts can systematically delay termination under a controlled generation interface.
That distinction is important. A benchmark can establish a resource-exhaustion primitive without proving that every deployed service can be taken offline with it.
The paper calls this primitive stopping-time vulnerability.
A benchmark turns “the model talked too long” into comparable evidence
Response length alone is difficult to compare across models. Generating 4,000 tokens is extreme for a model with a 4,000-token nominal context window, but less informative for a model designed around a much larger window.
The authors therefore define the Over-Generation Factor, or OGF:
where $L$ is the number of newly generated tokens and $C$ is the victim model’s nominal context window.
An OGF of 1 means the model generated output equal in length to one nominal context window. An OGF of 2 means twice that amount. The metric does not convert output directly into dollars, GPU energy, or lost throughput. It provides a normalized measure of how far generation continued relative to the model’s stated context scale.
The benchmark supplements OGF with several diagnostics:
| Metric | What it measures | Operational interpretation |
|---|---|---|
| Success@OGF ≥ $k$ | Fraction of trials reaching at least $k$ times the nominal context window | Frequency of severe length amplification |
| Stall rate | Fraction of trials hitting the generation cap without emitting EOS | How often an external limit, rather than the model, terminates generation |
| Latency | Wall-clock duration of victim-model generation | Direct evidence of slower request completion |
| Tail persistence | Repetition or lack of novelty near the end of output | Whether long generation degrades into persistent looping |
Together, these metrics separate three behaviours that are often carelessly grouped together.
A model may generate a long response and still stop by itself. It may reach a high OGF only occasionally because sampling is stochastic. Or it may repeatedly hit the configured cap without emitting EOS, creating a more dependable resource-consumption mechanism.
The last case is the most operationally uncomfortable. It means the system’s safety boundary is not the model deciding to stop. It is the infrastructure cutting the model off.
Four prompt sources reveal four different levels of attacker effort
The paper’s comparison is easier to understand as an escalation ladder.
At the bottom are ordinary or naive prompts. Above them are short random prefixes and handcrafted requests for repetition or endless text. EOGen then searches systematically for effective short prefixes. Finally, RL-GOAL learns a policy for constructing prefixes that target long continuations.
| Attacker level | Construction method | Information required | Main question tested |
|---|---|---|---|
| Ordinary baseline | No attacker prefix or standard instruction | None | How long does the model naturally continue? |
| Naive adversarial prompt | Repetition requests, “infinite babble,” or random short words | Basic prompt access | Is obvious prompting sufficient? |
| EOGen | Evolutionary search over short word-like token sequences | Query access and known tokenizer | Can inexpensive automated search find stronger prefixes? |
| RL-GOAL | Goal-conditioned reinforcement-learning policy | Many training-time victim queries and known tokenizer | Does learned prompt construction produce a stronger, more transferable attack? |
Both automated methods operate without victim-model gradients, hidden states, or logits. The attacker supplies prefixes, observes the generated continuation, and adjusts future prompts based on the result.
This is black-box in the sense that the model’s internal parameters remain inaccessible. It is not a fully opaque setting: the tokenizer is assumed to be known, and the experiments use direct access to a text-generation interface with controlled decoding settings.
The distinction keeps the threat model credible without making it theatrical. Many open models publish their tokenizers, while commercial services usually place additional layers between a user and the underlying generation call.
Naive prompts create noise; EOGen finds a signal
EOGen searches for prefixes containing three to seven word-like tokens. Candidate prompts are evolved through selection, crossover, and mutation. Prompts that induce longer continuations receive higher fitness, while prompt length and early EOS emission are penalized.
Restricting the search to English-like tokens serves two purposes. It keeps the search space manageable, and it avoids relying entirely on visibly synthetic token fragments. The resulting prompts are not necessarily natural requests, but they are closer to readable text than arbitrary vocabulary IDs.
The main EOGen results compare the discovered prefixes with repeat-style prompts, an explicit infinite-babble instruction, random short prefixes, and WizardLM-style instructions. Each prompt is evaluated repeatedly under stochastic decoding, using a generation budget equal to four times the victim model’s nominal context window.
On Phi-3-mini-4k-instruct, EOGen produces a mean OGF of 1.39 ± 1.14. It reaches OGF ≥ 2 in 25.2% of trials and OGF ≥ 4 in 6.8%.
The explicit infinite-babble baseline performs noticeably worse at the higher thresholds: its mean OGF is 0.96 ± 0.74, with 9.4% reaching OGF ≥ 2 and 0.5% reaching OGF ≥ 4.
Random short prefixes produce a mean OGF of 0.51 ± 0.72 on the same model, with 5.2% reaching OGF ≥ 2.
The cheap attack, then, is not simply telling the model to continue forever. Nor is extreme generation a routine consequence of attaching a few random words. Automated search identifies prefixes whose effects are materially stronger than those naive alternatives.
The pattern also appears across LLaMA-2-7B and DeepSeek-Coder-7B, although the magnitude is lower and individual metrics are not uniformly superior in every comparison cell. That variation is itself informative: the same attack family does not impose a fixed level of risk across models.
EOGen results across three victim models
| Victim model | EOGen mean OGF | EOGen Success@OGF ≥ 2 | EOGen Success@OGF ≥ 4 |
|---|---|---|---|
| Phi-3-mini-4k-instruct | 1.39 ± 1.14 | 25.2% | 6.8% |
| LLaMA-2-7B-HF | 0.47 ± 0.68 | 4.5% | 0.7% |
| DeepSeek-Coder-7B-Base-v1.5 | 0.49 ± 0.87 | 7.7% | 1.6% |
Phi-3 is clearly the most vulnerable of the three under this EOGen configuration. LLaMA-2 and DeepSeek-Coder are not invulnerable, but their aggregate severity is substantially lower.
That does not prove that Phi-3’s architecture is inherently less secure. The models differ in training, alignment, context configuration, tokenizer, and intended use. The benchmark identifies the outcome; it does not isolate which design choice caused it.
Appending an attack to a normal instruction changes the result
EOGen’s strongest results come from using the discovered sequence as the full input. The paper also evaluates an EOGen-suffix condition, where the adversarial sequence is appended to a WizardLM instruction.
This is an important test because many real requests contain meaningful user content before any adversarial addition. An attack that works only as a standalone prefix may be easier to filter and less representative of deployed chat interactions.
The suffix condition produces mixed results.
On Phi-3, mean OGF falls from 1.39 for direct EOGen prompts to 0.74 for EOGen-suffix. Success@OGF ≥ 2 falls from 25.2% to 8.2%.
On LLaMA-2, however, the suffix version raises mean OGF from 0.47 to 0.58 and raises Success@OGF ≥ 1 from 17.4% to 25.4%, even while its more severe OGF ≥ 2 and OGF ≥ 4 rates decline.
The lesson is not that suffix attacks always transfer cleanly into realistic instructions. They do not. The lesson is that prompt placement and surrounding context change termination behaviour in model-specific ways.
For an operator, this complicates testing. Evaluating only isolated adversarial strings can exaggerate some risks and miss others. A useful red-team suite should test attack material as standalone input, as a suffix to benign requests, and within the actual chat template used in production.
Stalling is heavy-tailed, so averages hide the prompts worth finding
EOGen does not make every prompt equally dangerous.
The paper’s prompt-level analysis shows a heavy-tailed distribution: a relatively small subset of prompts produces a disproportionate share of cap-hitting behaviour. Many discovered prefixes generate modest effects, while a narrower tail repeatedly drives the model toward non-termination.
This matters for both attackers and defenders.
For an attacker, the relevant outcome is not the average candidate produced during search. It is the small set of reliable high-severity prompts that can be retained and reused.
For a defender, an average response-length dashboard may look acceptable while a few prompt patterns create unusually expensive requests. Monitoring only the mean is a fine way to discover that the server is healthy on average while a queue is quietly catching fire.
The paper also notes that a per-prompt maximum over repeated trials is an optimistic statistic. A prompt that produces one extreme continuation out of ten attempts is different from a prompt that stalls consistently. That is why the stall-rate distribution and repeated stochastic evaluation are necessary.
RL-GOAL converts prompt search into a trained capability
EOGen treats each prompt as a candidate to be evolved. RL-GOAL takes a more expensive route: it trains a compact transformer policy to construct attacker prefixes conditioned on a desired continuation length.
The victim model is not fine-tuned. Instead, the attacker learns which token choices tend to push the victim toward a target output length. Training uses Proximal Policy Optimization, a goal curriculum, replay, and hindsight goal relabelling.
The curriculum begins with more attainable continuation targets and gradually expands toward longer outputs. This reduces wasted computation early in training, when asking the attacker to reach the maximum generation budget would provide little useful learning signal.
The resulting attacker is much more expensive to develop than EOGen. It requires many victim-model rollouts, long decoding budgets, and a dedicated training process. But the evaluation shows what that extra investment buys.
RL-GOAL is trained using LLaMA-2-7B as the victim. At evaluation time, its generated prefixes are also tested against other models by decoding them with the LLaMA tokenizer and re-encoding them with each victim’s tokenizer.
The comparison therefore tests both attack strength and partial cross-model transfer.
RL-GOAL versus no attacker prefix
| Victim model | Mean OGF: RL-GOAL | Mean OGF: no prefix | Success@OGF ≥ 2: RL-GOAL | Success@OGF ≥ 2: no prefix | Stall: RL-GOAL | Stall: no prefix |
|---|---|---|---|---|---|---|
| LLaMA-2-7B-HF | 2.04 ± 1.37 | 0.36 ± 0.58 | 36.0% | 3.1% | 28.7% | 0.2% |
| LLaMA-2-13B-Chat-HF | 0.50 ± 0.51 | 0.09 ± 0.13 | 0.1% | 0.0% | 0.0% | 0.0% |
| Phi-3-mini-4k-instruct | 2.70 ± 1.43 | 0.66 ± 0.82 | 64.3% | 7.8% | 46.0% | 2.3% |
| EleutherAI/Pythia-6.9B | 1.52 ± 1.28 | 0.82 ± 1.54 | 13.6% | 10.9% | 0.3% | 1.8% |
The most striking result is Phi-3. RL-GOAL reaches OGF ≥ 2 in 64.3% of trials and stalls in 46.0%. Its mean continuation length is 11,061 ± 5,855 tokens, compared with 2,687 ± 3,343 tokens for the no-prefix baseline.
LLaMA-2-7B, the training victim, also shows substantial amplification: 36.0% of trials reach OGF ≥ 2, while 28.7% stall.
LLaMA-2-13B-Chat behaves very differently. RL-GOAL raises the share of outputs reaching OGF ≥ 1 from 0.5% to 35.2%, but almost none reach OGF ≥ 2, none reach OGF ≥ 4, and none stall.
Pythia falls between the two extremes. It shows moderate over-generation, but the trained attack does not increase stalls above the no-prefix baseline.
This is the paper’s most useful operational finding. Attack sophistication matters, but model choice matters just as much. A trained attacker that is highly effective against one victim may transfer strongly to another, weakly to a third, and barely at all to a fourth.
A single “LLM vulnerable” label is therefore almost useless. Stopping-time resilience must be measured for the exact model, decoding configuration, and prompt wrapper being deployed.
The random-policy ablation shows that learning, not prefix length, drives RL-GOAL
A long or unusual prefix can change model behaviour even without deliberate optimization. The paper therefore replaces the trained RL-GOAL policy with a uniform-random policy while preserving the same evaluation interface and prefix budget.
This is an ablation, not a second main experiment. Its purpose is to isolate whether RL-GOAL succeeds because it has learned useful token preferences and goal conditioning, or merely because it adds extra text before the victim generates.
On LLaMA-2-7B, the trained policy reaches OGF ≥ 2 in 36.0% of trials and stalls in 28.7%.
The uniform-random policy reaches OGF ≥ 2 in only 3.4% of trials and stalls in 0.4%.
That difference is substantial. RL-GOAL is not simply consuming prompt space or perturbing the model with arbitrary tokens. The learned policy discovers prefixes that reliably change stopping behaviour.
For defenders, the implication is slightly unpleasant: filters designed around obvious repetition or random-looking strings address only the cheapest attack tier. Once prompt construction is optimized, the relevant patterns may not resemble the examples defenders initially imagined.
The EOGen ablations explain why a narrower search can be stronger
The EOGen ablations test two design choices: limiting candidate tokens to word-like English tokens and penalizing prompt length.
Removing the word-like filter might appear to give the attacker more freedom. It opens almost the entire vocabulary for mutation, excluding only EOS and padding tokens. Yet the broader search performs worse under the limited query budget.
In the all-token ablation on LLaMA-2-7B, mean OGF remains below 1, while Success@OGF ≥ 2 is roughly 3%.
The likely mechanism is search efficiency. A broader space contains more possible attacks, but it is also harder for a budget-constrained evolutionary algorithm to navigate. The linguistic filter acts as an inductive bias, concentrating search on a smaller region where useful prefixes are easier to discover.
Removing the prompt-length penalty also weakens performance. Even though prompts remain constrained to three to seven tokens, the penalty appears to help stabilize the search and discourage degenerate candidates.
These are not robustness tests showing the attack works under every possible configuration. They are component ablations explaining why the proposed EOGen setup performs as it does.
| Appendix test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| All-token EOGen search | Ablation of the linguistic search restriction | A narrower word-like space improves search efficiency under the tested budget | Word-like prompts are always the strongest possible attacks |
| EOGen without length regularization | Ablation of the fitness function | Penalizing prompt length helps stabilize the tested search | Longer prompts are generally safer |
| Uniform-random RL policy | Ablation of learned behaviour | RL-GOAL’s gains come from learned token preferences and goal conditioning | The learned policy is optimal or stealthy |
| EOGen-suffix evaluation | Variant test of prompt placement | Surrounding benign context changes attack severity | The attack transfers unchanged to production chat systems |
| Cross-model RL-GOAL evaluation | Exploratory transfer test | A policy trained on one victim can affect other models unevenly | Vulnerability differences are caused by any single architectural feature |
Latency makes the resource cost visible
Output length is an attack proxy. Latency is the more immediate operational consequence.
For RL-GOAL, the reported median generation time on LLaMA-2-7B rises from 11.92 seconds with no prefix to 135.85 seconds with the learned attack. Using the paper’s medians, that is roughly an 11.4-fold increase.
On Phi-3, median latency rises from 24.78 seconds to 267.88 seconds, or roughly 10.8 times.
Pythia’s median rises from 9.21 seconds to 67.86 seconds, while LLaMA-2-13B-Chat shows a smaller increase from 6.44 seconds to 14.31 seconds.
These measurements are not universal throughput estimates. Latency depends on hardware, batching, implementation, decoding settings, and concurrent load. The experiments measure victim-side wall-clock generation under the authors’ evaluation setup.
Still, the relationship is operationally clear. Longer continuations occupy inference resources for longer periods. A request that remains active for several minutes instead of several seconds consumes capacity that could otherwise serve other users.
The business risk is not limited to token-based billing. It also appears in queueing delays, reduced throughput, timeout handling, worker availability, and degraded service-level performance.
What the paper directly shows, and what businesses should infer
The paper directly shows that:
- Prompt-only, query-only attackers can systematically increase continuation length under fixed decoding interfaces.
- Evolutionary search finds short word-like prefixes that generally outperform naive prompt families.
- A trained goal-conditioned attacker can produce much stronger over-generation on susceptible models.
- Attack effectiveness varies sharply across victim models.
- Severe over-generation often corresponds with large latency increases and, for some models, frequent cap-hitting without EOS.
- Search restrictions and learned token preferences materially affect attack performance.
From those results, Cognaptus infers that operators should treat termination behaviour as a first-class abuse surface.
That means model evaluation should not end with answer correctness, toxicity rates, jailbreak resistance, or average latency under benign prompts. A model can remain accurate and policy-compliant while consuming excessive resources.
A practical stopping-time red-team programme should compare candidate models under the same decoding policy and record at least:
- output-length distributions;
- OGF or an equivalent normalized length measure;
- cap-hit and EOS rates;
- latency amplification relative to benign prompts;
- repeated-trial consistency for suspicious prompts;
- behaviour when adversarial content is embedded in normal requests.
The goal is not to recreate the paper’s exact attack infrastructure in every company. It is to stop assuming that setting a maximum output limit completes the security design.
A hard limit prevents infinite generation. It does not prevent an attacker from repeatedly forcing the service to consume the entire allowance.
Deployment controls should limit cost, not merely recognize bad words
Static prompt filters are attractive because they are simple. They may catch requests that explicitly demand endless repetition or contain known attack strings.
The comparison in this paper shows why that is insufficient. The strongest attacks are discovered or learned, and the heavy-tailed results suggest that a small number of unusual prompts may account for much of the risk.
More useful controls operate across several layers.
Before generation: assign a request budget
Providers can estimate how much output a request should reasonably require based on task type, user history, plan tier, and recent behaviour. A request for a one-line classification should not automatically receive the same generation allowance as a long-form report.
This is a business-policy inference, not something directly tested in the paper. Its value follows from the demonstrated asymmetry between short inputs and long outputs.
During generation: monitor termination behaviour
A system can track output length, repetition, token-generation time, and the probability or absence of stopping signals. When generation enters an anomalous regime, it can reduce the remaining budget, apply a stop rule, or terminate the request.
The paper does not evaluate specific defensive algorithms. It establishes the metrics such systems would need to monitor.
Across requests: detect repeated cost amplification
A single long response may be legitimate. Repeated requests that consistently drive generation toward the cap are more suspicious.
Adaptive throttling should therefore consider consumed inference resources, not only request counts. Ten inexpensive requests and ten cap-hitting requests should not be treated as equivalent traffic.
During model selection: test the exact deployment configuration
The contrast between Phi-3, LLaMA-2-7B, LLaMA-2-13B-Chat, and Pythia shows that termination resilience cannot be inferred reliably from parameter count or model family name.
Evaluation should use the production prompt template, stop sequences, repetition penalties, sampling controls, and output caps. Changing any of these may change the result.
The paper benchmarks a vulnerability primitive, not a finished outage
The findings are significant, but their scope is specific.
First, the experiments use a limited set of relatively small open-source models. They do not establish the vulnerability rate of frontier hosted models or long-context systems.
Second, the threat model assumes consistent access to a generation interface and a known tokenizer. Commercial APIs may impose rate limits, filtering, hidden system prompts, dynamic throttling, stop sequences, or provider-side model changes. Those controls may weaken attacks, change their economics, or reduce reproducibility.
Third, the attack objectives prioritize continuation length rather than stealth, naturalness, or semantic plausibility. Some discovered prefixes may therefore be easier to identify than an attacker optimized to resemble normal user traffic.
Fourth, RL-GOAL is expensive to train. Its stronger results do not imply that every attacker can immediately deploy it profitably. They demonstrate what becomes possible when an attacker is willing to spend more queries and computation during attack development.
Finally, the paper does not simulate a shared production service under load. It measures per-query stopping behaviour and latency. Turning those effects into a denial-of-service outage would depend on system capacity, batching, concurrency, admission controls, and the economics of repeated requests.
These boundaries do not make the benchmark unimportant. They clarify what it provides: a controlled way to identify whether the model layer contains a resource-amplification weakness before the surrounding infrastructure is forced to discover it in production.
The real security property is not EOS; it is bounded resource consumption
It is tempting to describe the problem as an EOS failure. The model does not emit its stopping token, so the obvious solution is to improve termination.
But a model can impose excessive cost even if it eventually stops. Conversely, infrastructure can remain resilient even when a model would happily generate forever, provided the service assigns and enforces sensible resource budgets.
The business-level security property is therefore broader:
An untrusted user should not be able to obtain disproportionate inference resources through a cheap, repeatable prompt strategy.
The paper contributes a useful attack-side benchmark for measuring one part of that property. EOGen shows that relatively inexpensive automated search can uncover weaknesses that naive prompts miss. RL-GOAL shows that a determined attacker can learn a considerably stronger prompt-construction policy. The cross-model results show that susceptibility is neither uniform nor easily guessed.
Most importantly, the comparison changes the defensive question.
The question is no longer, “Does the model sometimes generate an absurdly long answer?”
It is, “Under the strongest affordable prompt attack we can test, how often does this deployment consume its full resource allowance—and what happens to everyone else when it does?”
An AI service does not need to be persuaded to reveal secrets before it becomes vulnerable.
Sometimes it only needs to be persuaded not to stop.
Cognaptus: Automate the Present, Incubate the Future.
-
Manu Yi Guo et al., “Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark,” arXiv:2512.23779, https://arxiv.org/abs/2512.23779. ↩︎