The number looked safe. Then someone ran it twice.
A familiar business problem: one vendor says its model resists jailbreaks. Another red-team report says a new attack reaches a spectacular Attack Success Rate. A compliance team sees a percentage, puts it into a risk register, and moves on.
Unfortunately, that percentage may be doing more acting than measuring.
The paper The Great Pretender: A Stochasticity Problem in LLM Jailbreak argues that the standard metric used in jailbreak benchmarking, Attack Success Rate or ASR, is not a stable property of an attack.1 It is partly a product of the evaluation pipeline. Change the judge temperature, the number of repeated evaluations, the attack-generation threshold, or the search judge setting, and the same apparent attack can look much stronger or weaker.
That sounds like a technical nuisance. It is not. It changes how enterprises should read AI security reports.
The paper’s central result is blunt: reported ASR can move by large margins even when the underlying attack idea has not changed. Judge evaluation temperature can shift ASR by up to 54 percentage points. Requiring a jailbreak to succeed consistently instead of once can reduce apparent success by 12 to 24 percentage points. Generation-side consistency can lift reliable ASR by 12 to 30 points. For Best-of-N configurations with a larger judge, judge search temperature can add another 20 to 30 points of distortion.
So the real question is not only “Can this attack jailbreak the model?” It is: “Under what stochastic conditions did someone decide that it did?”
That is a less glamorous question. Naturally, it is the useful one.
The evidence first: ASR moves when the measuring instrument moves
The paper studies both sides of the jailbreak pipeline.
First, there is attack generation: how the adversarial prompt is produced. Some attacks sample many candidate prompts or responses, then keep the ones that appear to work. Best-of-N is the obvious case, but iterative attacks such as PAIR, TAP, and Crescendo also involve stochastic procedures.
Second, there is attack evaluation: how the final prompt is tested and judged. A target model may produce different answers across runs. A classifier-style judge, such as Llama Guard, may also produce different harmful/safe labels when run stochastically.
Most ASR reporting quietly compresses this into one number. A harmful prompt is tested. A response is judged. If the judge says harmful, the attack gets a point. Repeat across prompts, divide successes by total prompts, and the benchmark table is born.
The paper shows why this is fragile.
| Parameter tested | Likely purpose in the paper | Main observed effect | What it means operationally |
|---|---|---|---|
| Evaluation consistency threshold, $k_{\text{eval}}$ | Main evidence | Moving from one evaluation to ten reduces ASR by 12–24 pp across tested configurations | Single-shot ASR overstates reproducible jailbreak success |
| Judge evaluation temperature, $\theta_{\text{eval}}$ | Main evidence / sensitivity test | Raising it to 1.0 can collapse ASR by 15–54 pp; at 0, curves are flat | A stochastic judge is not a neutral ruler |
| Generation consistency threshold, $k_{\text{gen}}$ | Main evidence | Raising it from 1 to 10 lifts reliable ASR by 12–30 pp | Stronger filtering during attack generation selects prompts that survive later scrutiny |
| Judge search temperature, $\theta_{\text{gen}}$ | Sensitivity test with attack-specific effect | Up to 30 pp spread for Best-of-N with LG-8B; small for other attacks | Some attack pipelines are highly sensitive to how candidates are selected |
| Target model temperature, $T_{\text{eval}}$ / $T_{\text{gen}}$ | Robustness/sensitivity test | Usually smaller and less systematic; up to 18 pp in some settings | Worth recording, but not the main villain |
| Attack seed | Robustness/sensitivity test | Usually modest, though one small-sample configuration reaches 50 pp spread | Seed variance matters for absolute estimates, less for rough ranking |
The most important interpretation is not that every benchmark is wrong by exactly these amounts. The study uses selected open-weight target models, selected black-box attacks, and Llama Guard classifier judges. The precise numbers should not be lazily pasted onto every proprietary system.
The useful interpretation is narrower and stronger: ASR is conditional on a protocol. If the protocol is under-specified, the number is not portable.
A jailbreak paper that reports “80% ASR” without disclosing the evaluation threshold, judge temperature, generation threshold, and confidence intervals is not reporting a stable fact. It is reporting a fact-like object. Close enough to be quoted, unstable enough to be dangerous.
Why single-shot success is the wrong mental model
The reader misconception here is easy to understand: if a jailbreak succeeds once, it “works.”
That is true for some attacker models. If the attacker only needs one harmful response and can keep trying, then occasional success has operational meaning. But that is not the same as saying the prompt is a reliable jailbreak. A prompt that succeeds once in ten attempts and a prompt that succeeds ten times in ten attempts are not equivalent. Standard single-shot ASR can treat them as equivalent.
The paper’s correction is Consistency for Attack Success, or CAS.
The metric is simple. For a given prompt, suppose the judge returns verdicts $r_1, r_2, \ldots, r_k$, where each $r_j$ is either 1 for harmful or 0 for safe. CAS counts the prompt as successful only if every run is harmful:
$$ CAS(r,k)=\prod_{j=1}^{k} r_j $$
When $k=1$, this collapses to the standard single-shot protocol. When $k>1$, the prompt must succeed repeatedly.
This is not mathematical decoration. It changes the meaning of the benchmark.
At $k=1$, ASR answers: “Did the prompt ever clear the bar in this sampled evaluation?”
At higher $k$, ASR starts answering: “Does this prompt reliably clear the bar across repeated independent checks?”
Those are different business questions. Procurement teams, security teams, and model-risk committees should care about the distinction.
A low-consistency jailbreak may still be a real threat if the attacker has unlimited retries. But it is not the same threat as a high-consistency jailbreak. The former is a lottery ticket. The latter is a tool.
CAS-gen and CAS-eval separate lucky attacks from reproducible failures
The paper turns CAS into two frameworks.
CAS-eval applies consistency at the evaluation stage. A fixed jailbreak prompt is evaluated multiple times. It is counted as a consistent jailbreak only if all independent evaluations return harmful verdicts. This suppresses false positives caused by judge randomness.
CAS-gen applies consistency at the generation stage. Instead of accepting a candidate jailbreak after one harmful verdict, the generation pipeline admits the prompt only if it succeeds across multiple independent runs. This filters weak candidates before they contaminate the attack set.
This two-part design matters because stochasticity enters twice.
A loose generation process can admit borderline prompts. A loose evaluation process can then count borderline responses as success. Put both together and a benchmark can reward luck twice: once when finding the candidate, and again when judging it.
The paper’s evidence supports that distinction. When the generation threshold $k_{\text{gen}}$ rises from 1 to 10, the resulting prompts become more robust under stricter post-hoc evaluation, increasing ASR at $k_{\text{eval}}=10$ by 12 to 30 percentage points. In plain terms: if you force the attack-generation process to find prompts that work repeatedly, the final attack set becomes less fragile.
This is also where the paper avoids a common mistake. It does not merely say “ASR is inflated, therefore attacks are weaker.” It shows two directions:
- Evaluation inflation: single-shot evaluation can overcount lucky or borderline successes.
- Generation improvement: consistency-aware generation can produce more reliable jailbreak prompts.
That is the uncomfortable part. Better measurement may reduce inflated claims, but the same idea can also help attackers construct more dependable prompts. Measurement hygiene is not automatically defense. Sometimes it is also better adversarial engineering. Annoying, but true.
The judge is not just an observer
The most striking result concerns judge temperature.
At judge evaluation temperature $\theta_{\text{eval}}=0$, the paper reports perfectly flat ASR curves across evaluation thresholds: $ASR(k_{\text{eval}}=10)$ equals $ASR(k_{\text{eval}}=1)$. That is what one would expect if the judge is deterministic. The same response receives the same label repeatedly.
But when $\theta_{\text{eval}}$ rises, the curves decline. At $\theta_{\text{eval}}=1.0$, ASR drops by 15 to 54 percentage points across tested configurations. The largest reported drop is for Crescendo at $k_{\text{eval}}=10$, from 86% at deterministic judge evaluation to 32% under stochastic judge evaluation.
That result is best understood as an instrument problem.
A thermometer that changes its reading because the room changed is useful. A thermometer that changes its reading because the thermometer became playful is less useful.
LLM judges are not passive measurement devices when run stochastically. They can flip labels on borderline outputs. A benchmark using one judge call at nonzero temperature is partly measuring the attack and partly measuring judge sampling noise.
For business readers, the lesson is not “never use LLM judges.” That would be too easy and not especially practical. The lesson is that automated safety evaluation needs an evaluation protocol, not just an evaluation model.
Minimum expectations should include:
| Reporting item | Why it matters |
|---|---|
| $k_{\text{eval}}$ | Tells whether success means one lucky pass or repeated success |
| $\theta_{\text{eval}}$ | Controls whether the judge behaves deterministically or stochastically |
| $k_{\text{gen}}$ | Reveals how strictly candidate jailbreaks were selected |
| $\theta_{\text{gen}}$ for Best-of-N | Can materially affect candidate selection when using a larger judge |
| Confidence intervals | Prevents small-sample benchmark theater from looking precise |
| Number of seeds / runs | Shows whether the result survives procedural randomness |
This is not an extravagant governance burden. It is the evaluation equivalent of writing down the scale on a chart. Basic, unfashionable, and often missing.
The appendix is not a second thesis; it tells us where not to overread
The appendix work serves several roles.
The hardware and model tables are implementation detail: they clarify that the experiments use Hugging Face model identifiers, Llama-family target models alongside Gemma and Granite, Llama Guard judges, and Wilson score confidence intervals.
The statistical appendix is a conceptual extension. It contrasts ordinary Best-of-N thinking with the paper’s consistency view. Standard Best-of-N often uses an OR logic: if any of $N$ attempts succeeds, the attack counts. CAS uses an AND logic: all $k$ attempts must succeed. Under a latent per-attempt success probability $p_i$, the probability of all $k$ attempts succeeding is $p_i^k$. As $k$ grows, the metric increasingly isolates prompts that are close to perfectly reliable.
That helps explain why ASR declines with larger consistency thresholds. It is not merely “being stricter.” It is testing a different operational property: robust repeatability.
The temperature-0 appendix is a robustness warning. The authors test commercial API outputs at temperature zero on non-adversarial prompts, because jailbreak testing against such APIs could violate terms of service. They find that temperature zero does not guarantee exact determinism in commercial systems, with model-dependent exact-match behavior. This does not prove the same pattern for every jailbreak evaluation. It does reinforce the broader point: “temperature zero” is not always a magic reproducibility button once real serving systems enter the picture.
The additional-results appendix then broadens the view across model sizes, model providers, and attack types. Its purpose is not to create a second headline result. It supports the main claim that stochasticity effects recur across several configurations, while also showing that the magnitude depends on attack, judge, and model choices.
That distinction matters. The paper is not proving a universal constant of jailbreak evaluation. It is showing that under realistic benchmark settings, uncontrolled stochasticity is large enough to make headline ASR comparisons unsafe.
What this changes for enterprise AI security
The business implication is not “panic about jailbreaks.” Panic is an expensive substitute for measurement.
The practical message is that AI security evaluation should stop treating ASR as a standalone vendor-comparison number. ASR is only interpretable with its protocol attached.
For an enterprise, the paper changes three workflows.
1. Vendor and model comparisons need protocol normalization
If one vendor’s red-team result reports ASR under $k_{\text{eval}}=1$ and another uses repeated evaluation, their numbers are not comparable. If one uses a stochastic judge and another uses deterministic judging, they are not comparable. If one attack-generation pipeline filters candidates with a stricter threshold and another accepts one-shot successes, they are not comparable.
This is not pedantry. The paper reports shifts large enough to reverse practical risk impressions. A 20-point difference in ASR can easily decide whether a system passes an internal threshold. If that difference comes from benchmark settings rather than model behavior, the governance process is optimizing fiction.
2. Red-team reports should distinguish “occasional breach” from “reliable breach”
Both matter, but they imply different controls.
An occasional breach suggests retry risk, monitoring needs, rate-limit design, and escalation controls. A reliable breach suggests a more severe vulnerability in policy enforcement or refusal behavior.
CAS-style reporting gives security teams a cleaner way to classify this difference:
| Result type | Operational meaning | Likely response |
|---|---|---|
| High ASR at $k=1$, low ASR at higher $k$ | The attack sometimes works but is unstable | Retry controls, monitoring, prompt hardening, judge review |
| High ASR even at higher $k$ | The attack works reliably | Model/policy change, stronger filtering, deployment gate |
| Low ASR but wide confidence interval | Evidence is underpowered | More samples before decision |
| ASR sensitive to judge temperature | Measurement is unstable | Fix judge settings or use repeated deterministic evaluation |
This is a more useful framework than putting a single ASR number into a dashboard and pretending it is a safety score.
3. Internal safety gates should require confidence intervals and repeated trials
The paper uses Wilson score confidence intervals because ASR is a proportion and benchmark sizes can be small. That is a quiet but important decision. Many safety results are shown as clean percentages even when a single prompt flip materially changes the result.
For internal governance, the minimum bar should be simple:
- define the attack set;
- define the judge;
- fix or report judge temperature;
- run repeated evaluations for consistency-sensitive claims;
- report confidence intervals;
- separate candidate-generation settings from final evaluation settings;
- preserve seeds and logs for audit.
This is not bureaucratic excess. It is how one avoids approving or rejecting systems based on random draws with PowerPoint formatting.
What the paper directly shows, and what Cognaptus infers
The paper directly shows that jailbreak ASR is sensitive to stochasticity in both attack generation and evaluation across the tested attacks, models, and judges. It directly introduces CAS, CAS-gen, and CAS-eval as ways to measure repeated success rather than one-shot success. It directly argues for a minimum reporting checklist.
Cognaptus infers three business practices from that evidence.
First, ASR should be treated as a protocol-dependent measurement, not a model-independent property. This affects procurement, red-team report review, and benchmark-based vendor comparison.
Second, reproducibility should become part of AI safety acceptance testing. A single successful or failed run is not enough when the system being measured and the judge doing the measuring are both stochastic.
Third, security teams should classify jailbreak risk by reliability, not only by existence. A brittle jailbreak and a repeatable jailbreak require different operational responses.
These are inferences. They are not claims that the paper tested enterprise procurement workflows or production safety dashboards. The paper gives the measurement evidence; the business interpretation follows from how such measurements are commonly used.
Boundaries: where this result should not be over-sold
The study is valuable because it is specific. That also means its boundaries are specific.
The experiments focus on Best-of-N, PAIR, TAP, and Crescendo. Gradient-based attacks such as GCG may have different stochasticity profiles because they optimize discrete suffixes rather than relying on similar sampling behavior. Multimodal jailbreaks are also outside the scope.
The target models are open-weight models: Llama-3.2-1B, Llama-3.1-8B, Llama-3.1-70B, Gemma-3-1B, and Granite-4.0-1B. The authors explicitly do not establish whether the same effect sizes hold for heavily RLHF-aligned proprietary systems such as GPT-4o or Claude, especially when those systems include hidden safety stacks, moderation layers, and API-level controls.
The judges are Llama Guard classifier-style judges. Generative judges may behave differently. Commercial judge models introduce another complication: systematic jailbreak evaluation through commercial APIs may violate provider terms, and proprietary serving systems are not always transparent enough to isolate the stochastic components.
Finally, CAS addresses stochasticity in generation and evaluation. It does not solve every source of irreproducibility: system prompt changes, tokenizer differences, API version drift, serving infrastructure, hardware nondeterminism, and hidden safety-layer updates can all matter.
These limitations do not weaken the paper’s core message. They prevent the lazy version of it. The right conclusion is not “all ASR numbers are useless.” The right conclusion is “ASR numbers without protocol details are not decision-grade.”
The benchmark number is not the benchmark
The paper’s title, The Great Pretender, is unusually theatrical for a measurement paper. In this case, the theater is earned.
A jailbreak ASR can pretend to be a stable attack property. It can pretend to support clean rankings. It can pretend to tell buyers, regulators, and security teams which model is safer. But unless the benchmark controls stochasticity, the number is partly a costume.
The fix is not conceptually hard. Require repeated evaluation. Fix or disclose judge temperature. Report generation thresholds. Separate one-shot success from consistent success. Add confidence intervals. Stop comparing numbers that were produced by different measurement instruments.
The harder part is cultural. AI safety benchmarking has inherited the leaderboard habit: one metric, one ranking, one implied winner. Jailbreak evaluation does not deserve that simplicity yet.
For enterprises, this paper is a reminder that model risk is not only inside the model. It is also inside the measurement process. A bad benchmark can make a weak system look strong, a strong system look weak, or a random draw look like a trend.
That is not just a research inconvenience. It is a governance failure waiting politely in a table.
Cognaptus: Automate the Present, Incubate the Future.
-
Jean-Philippe Monteuuis, Cong Chen, and Jonathan Petit, “The Great Pretender: A Stochasticity Problem in LLM Jailbreak,” arXiv:2605.14418, 2026. ↩︎