A chatbot rarely fails all at once.
In production, failure is usually more boring than cinema. A user asks something borderline. The model refuses. The user rephrases. The model gives a harmless explanation. The user narrows the topic. The model follows the conversation. Then, several turns later, the assistant provides content it should not have provided. No thunder. No villain monologue. Just an interaction history doing what interaction histories do: accumulating context.
That is the useful lens for reading TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards.1 The paper is not interesting merely because it reports a stronger jailbreak attacker. Stronger attackers arrive regularly now, like browser updates, but with worse vibes. The more important contribution is conceptual: the paper treats multi-turn jailbreaks as a sequential strategy problem, not as a bag of isolated prompt tricks.
That distinction matters. If a company evaluates chatbot safety by testing single prompts, it is effectively asking whether a locked door can survive one kick. TROJail asks a different question: what happens when the attacker can try the handle, watch the response, adjust the pressure, and keep going?
The answer, according to the paper, is that the playbook changes.
The wrong mental model is “one prompt, repeated several times”
A common misconception is that multi-turn jailbreaks are just single-turn jailbreaks with extra steps. Under that view, safety testing should score each user message independently: harmful prompt, refusal; safe prompt, normal answer. Repeat until bored.
TROJail argues that this is the wrong object of analysis. In a multi-turn attack, the early messages may look benign or only mildly suspicious. Their value is not that they immediately trigger harmful output. Their value is that they move the conversation into a state where a later prompt becomes more effective.
This is why the paper distinguishes turn-level optimization from trajectory-level optimization.
Turn-level optimization rewards each prompt according to the harmfulness of the immediate response. That sounds sensible until one notices the problem: an early prompt can be strategically important precisely because it does not look maximally harmful. A greedy scoring system underrates that prompt because it asks, “Did this turn succeed?” rather than “Did this turn make the final failure more likely?”
Trajectory-level optimization changes the target. Instead of optimizing each turn in isolation, the attacker policy is trained to maximize the harmfulness of the final response over the entire interaction path. The paper implements this using a multi-turn variant of GRPO, where the relevant object is not one prompt-response pair but the dialogue trajectory.
That mechanism-first framing is the paper’s center of gravity. The jailbreak is not a magic phrase. It is a path.
The core engineering problem is delayed credit
Trajectory-level optimization creates a better objective, but it also creates a classic reinforcement learning headache: sparse supervision.
If the attacker receives meaningful feedback only at the end of the conversation, it has to infer which earlier turns helped and which turns wasted the budget. A five-turn interaction may succeed, but the final score alone does not explain whether the first turn established useful context, the second turn was irrelevant, or the third turn almost triggered a refusal. The training signal arrives late and speaks vaguely. Very corporate.
TROJail’s main technical move is to add two explicit process rewards. These are not separate goals; they are intermediate signals designed to make trajectory learning less blind.
The two rewards are:
| Process reward | What it measures | Why it matters |
|---|---|---|
| Over-harm penalization | Whether an intermediate prompt is too directly harmful and triggers refusal | A successful multi-turn attacker may need to avoid obvious malicious spikes before the final turn |
| Semantic relevance progression | Whether intermediate responses move closer to the target harmful topic | A conversation that drifts away from the target may remain long but become useless |
The first reward captures a counterintuitive pattern: being more harmful earlier is not always better. The paper’s controlled intervention study inserts prompts of different harmfulness levels into existing trajectories. Moderate harmfulness can improve the final outcome, but excessive harmfulness sharply reduces it by activating refusal behavior. In plain language, bad attackers shout too early.
The second reward captures direction. Successful trajectories show steadily increasing semantic relevance between intermediate responses and the original harmful prompt. Failed trajectories do not show the same progression. Meanwhile, ordinary harmfulness scores remain mostly uninformative until the final turn, which means they are a poor guide for intermediate steps.
Together, these rewards tell the attacker: do not reveal too much too soon, but do not wander off either.
That is the uncomfortable strategic lesson. The dangerous behavior is neither pure stealth nor pure aggression. It is controlled escalation.
TROJail is a trajectory optimizer with two guardrails for the attacker
The paper’s method can be read as a three-part training loop.
First, the attacker model generates multiple candidate multi-turn trajectories against a victim model. The victim model is treated as black-box, matching the practical setting where many strong commercial models expose only API behavior.
Second, each trajectory receives an outcome reward based on the final response. This is the main objective: did the interaction ultimately elicit harmful content?
Third, each intermediate turn receives process feedback from the two heuristic rewards. The over-harm reward discourages prompts that cause refusals. The semantic relevance reward encourages the conversation to move steadily toward the target topic. These process rewards are folded into advantage estimation, so the policy update receives denser information than final-outcome reward alone.
The important point is not the exact optimizer name. GRPO is the machinery. The idea is simpler: make the model learn a conversation policy, not a prompt template.
This makes TROJail different from both hand-crafted multi-turn attacks and earlier training-based methods. Hand-crafted approaches can encode tactics, but they often depend on fixed patterns. Turn-level training can learn harmful prompts, but it undervalues strategically useful intermediate moves. TROJail tries to learn the strategy as a sequence.
The main result: trajectory learning beats turn-level baselines
The headline experiment compares TROJail with single-turn and multi-turn jailbreak baselines across three benchmarks: HarmBench, StrongREJECT, and JailbreakBench. The victim models include Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, Gemma-2-9B-IT, and Mistral-7B-Instruct-v0.3. The attacker is initialized from Qwen2.5-3B-Instruct.
The key reported average attack success rates are:
| Method family | Method | Average ASR |
|---|---|---|
| Multi-turn RL, outcome-only | Naive GRPO | 81.43% |
| Multi-turn RL, implicit process reward | GRPO w/ IPR | 83.68% |
| Multi-turn RL, explicit process rewards | TROJail | 86.23% |
The first interpretation is straightforward: optimizing trajectories helps. Naive GRPO, despite relying only on the final outcome reward, already outperforms earlier single-turn and turn-level methods by a wide margin in the paper’s setup.
The second interpretation is more specific: process rewards help the trajectory optimizer learn faster and better. GRPO with implicit process reward improves over naive GRPO. TROJail improves further by using process rewards tied to observed jailbreak dynamics rather than relying only on implicit reward inference.
The third interpretation is the one businesses should care about: multi-turn risk is not adequately represented by isolated prompt testing. If an attack policy can learn to distribute intent across turns, then a safety evaluation that only asks “does this individual prompt look dangerous?” is looking at the wrong level of aggregation.
That does not mean every deployed chatbot is doomed. It means the evaluation unit has to shift from the prompt to the session.
The ablation explains why both process rewards matter
The ablation study uses Gemma-2-9B-IT as the victim model and tests the reward components. This is not the main cross-model result; it is a mechanism check.
| Reward setting | HarmBench | StrongREJECT† | JailbreakBench† | Average |
|---|---|---|---|---|
| Outcome reward only | 70.17 | 62.96 | 63.03 | 65.39 |
| Outcome + over-harm reward | 82.50 | 76.74 | 67.27 | 75.50 |
| Outcome + semantic relevance reward | 74.50 | 67.71 | 69.09 | 70.43 |
| Outcome + both process rewards | 83.83 | 77.31 | 72.12 | 77.75 |
The over-harm reward produces the larger single-component gain in this setup. That makes sense if we remember the mechanism: a multi-turn attacker must avoid triggering the victim model’s refusal behavior before the conversation reaches the final vulnerable state.
Semantic relevance also helps, but differently. It keeps the trajectory from drifting. Without it, an attacker may avoid refusal yet lose the target. That is not strategy; that is just wandering around in a trench coat.
The combined version performs best, which supports the paper’s claim that the two rewards are complementary. One controls intensity. The other controls direction.
The in-depth tests are mostly robustness and behavior analysis, not a second thesis
The paper includes several additional analyses. They are useful, but they should not be read as separate headline claims. Their purpose is to test whether the main mechanism behaves consistently under different conditions.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Transferability across victim models | Generalization test | Learned strategies can transfer to unseen victim models, especially when trained against more robust victims | Universal transfer to all commercial systems |
| Increasing attack turn limit | Sensitivity and behavior test | More turns generally improve ASR, with saturation; TROJail uses extra turns efficiently | That longer conversations always become unsafe |
| Prompt difficulty analysis | Robustness test | TROJail degrades more mildly on harder prompts and uses more turns for harder cases | That difficulty labels perfectly represent real-world harm categories |
| Reward-component ablation | Mechanism validation | Both process rewards improve trajectory learning, especially over-harm control | That these are the only useful process rewards |
| Judge model validation | Evaluation reliability check | Relative rankings remain broadly consistent across multiple judge models | That automated harmfulness judging is solved |
| Cost analysis | Operational comparison | TROJail has lower average inference query cost than selected baselines after training | That training cost is trivial for all organizations |
The transferability result is particularly important. When trained against one victim model and evaluated against others, TROJail still achieves substantial out-of-domain ASR. The paper reports that attackers trained against more robust victim models, such as Llama-3.1 and Gemma-2 in this evaluation, transfer better on average than attackers trained against easier targets.
For defenders, that is a useful inversion. Testing only against your easiest-to-break model may produce a weak red-team generator. Harder victims may force the attacker to learn more general strategies.
The turn-limit analysis also matters. TROJail is trained with a fixed turn limit, but performance continues improving when more turns are allowed at evaluation. This suggests the learned policy is not merely memorizing a five-turn script. It can use additional interaction budget. Again, the threat is not one prompt. The threat is adaptive persistence.
The cost result changes the red-team economics
Appendix D reports a cost comparison across selected methods. TROJail requires a one-time training cost of about 1,518 minutes on 4 A100 GPUs. That is not free. It is also not absurd by modern AI engineering standards.
At inference time, the paper reports that TROJail uses 6.54 model queries per attack on average, compared with 26.72 for ActorAttack and 28.78 for X-Teaming. In the paper’s comparison, TROJail also has the highest average ASR, 86.23%.
This matters because black-box red teaming is often constrained by query budget, latency, and API cost. A method that is expensive to train but cheaper to run may be attractive for organizations that repeatedly evaluate model releases, safety filters, prompt routers, or vertical chatbot deployments.
The business translation is not “buy more GPUs and panic.” It is more practical:
| Technical result | Operational consequence | Business relevance |
|---|---|---|
| Trajectory-level RL improves ASR | Red-team systems should evaluate sessions, not just prompts | Better detection of failures that emerge over interaction history |
| Process rewards improve learning | Safety tests can encode known failure dynamics as intermediate signals | More efficient vulnerability discovery than blind final-outcome testing |
| Transferability appears strong | Attack strategies may generalize across model families | Vendor swaps do not eliminate the need for multi-turn evaluation |
| Query cost is lower after training | Repeated evaluations become cheaper per attack attempt | Useful for continuous safety regression testing |
The ROI argument is therefore not only about catching more failures. It is about making the testing loop more repeatable.
For deployment teams, the defensive lesson is semantic drift monitoring
TROJail is an attack-generation framework, not a defense. Still, its mechanism suggests where defenses should look.
A prompt-level filter asks: “Is this message harmful?”
A trajectory-level defense also asks:
- Is the conversation moving gradually toward a restricted topic?
- Is the user avoiding direct harmful phrasing while maintaining suspicious semantic direction?
- Are refusals being followed by adaptive rephrasings that preserve the same underlying objective?
- Is the session becoming more specific in a risky domain over multiple turns?
This is not the same as blocking every long conversation. Many legitimate tasks require progressive narrowing. A customer-support agent, medical triage assistant, legal explainer, or developer copilot all need memory and context. Blocking depth would be a wonderfully expensive way to make the product useless.
The better lesson is that safety monitoring needs a session-level representation. The model’s policy, the safety classifier, and the observability layer should not treat each turn as a fresh universe. Context is not just useful for answering. It is also useful for risk assessment.
A practical enterprise safety pipeline might therefore include:
| Layer | Prompt-level control | Session-level extension |
|---|---|---|
| Input filtering | Detect clearly disallowed requests | Track semantic movement across turns |
| Model refusal | Refuse direct harmful requests | Maintain refusal consistency after rephrasing |
| Conversation memory | Preserve useful context | Flag escalation patterns and topic drift |
| Red-team evaluation | Test known bad prompts | Test adaptive multi-turn trajectories |
| Safety tuning data | Train on isolated prompt-response pairs | Train on full adversarial conversations with safe target responses |
The paper’s Appendix E points in a similar direction: generated trajectories can support automated red teaming, safety-alignment datasets, and adversarial co-training. That is a defensive use case, but it remains an inference from an attack framework. TROJail demonstrates how to discover failures. It does not, by itself, prove that training on those failures will close them robustly in production.
The limitations are real, but they do not erase the mechanism
The paper’s boundaries are important.
First, the results are benchmark-based. HarmBench, StrongREJECT, and JailbreakBench are useful evaluation suites, but benchmark behavior is not identical to live enterprise traffic. Production users bring messy context, mixed intent, multilingual phrasing, domain-specific policies, and tool-use constraints. Benchmarks simplify.
Second, the victim models are open-weight instruction models in the reported main setup. That makes the experiments controlled and comparable, but it does not directly establish the same ASR against every closed commercial model, every safety stack, or every retrieval/tool-augmented deployment.
Third, evaluation depends on judge models. The authors validate the HarmBench classifier against GPT-4o-style scoring and compare relative rankings across several judge models, finding broadly consistent rankings. That helps. It does not make automated harmfulness evaluation a solved science. Safety evaluation is still partly a measurement problem wearing a lab coat.
Fourth, the method does not explicitly optimize for attack diversity, although entropy regularization is used and the appendix explores adding an explicit diversity reward. This matters for red teaming because high success with low diversity may discover the same kind of failure repeatedly. The paper reports that TROJail achieves competitive diversity and that adding a diversity reward can improve diversity while maintaining comparable ASR. That is promising, but diversity remains an engineering axis, not a settled endpoint.
Finally, TROJail is not a packaged enterprise control. It is an attack-training framework that can inform defensive workflows. Turning it into production safety practice requires policy design, logging, privacy review, evaluation governance, and careful handling of synthetic harmful examples. In other words, the usual parade of unglamorous but necessary work.
What the paper directly shows, and what Cognaptus infers
The paper directly shows that, in its experimental setup, trajectory-level RL substantially improves automated multi-turn jailbreak performance, and explicit process rewards improve further over outcome-only and implicit-process baselines.
Cognaptus infers that businesses deploying LLM systems should treat multi-turn safety as a sequential risk problem. The practical unit of safety evaluation should be the conversation trajectory, not merely the individual prompt. Red-team systems should test whether users can gradually steer the model across safety boundaries. Monitoring should include semantic progression, refusal recovery patterns, and topic narrowing over time.
What remains uncertain is the size of the effect in each real deployment. A banking chatbot with narrow retrieval, strict tool permissions, and conservative refusal policies will not behave like a general-purpose open-weight model in a benchmark harness. A consumer assistant with broad capabilities and long context may face a different risk profile. The mechanism transfers more confidently than the exact percentages.
That is the right level of humility. The paper’s ASR numbers should not be copy-pasted into a board deck as production risk estimates. The mechanism should be taken seriously.
Conclusion: the jailbreak is becoming a conversation policy
TROJail’s most useful message is not that jailbreak attackers have become stronger. Everyone in AI safety already has enough reasons to drink coffee with suspicion.
The sharper message is that multi-turn safety failures are policy failures over time. A model can behave correctly at turn one, reasonably at turn two, and dangerously at turn five. The failure is not necessarily located in a single prompt. It may live in the path connecting them.
That changes how companies should test, monitor, and tune LLM applications. Single-turn prompt filters remain necessary, but they are no longer sufficient. The next layer of safety work is trajectory-aware: red teaming full sessions, training on adversarial conversations, and measuring whether the system notices gradual semantic movement toward restricted outcomes.
Prompt, probe, persist. That is the attacker’s playbook.
Defenders need one of their own.
Cognaptus: Automate the Present, Incubate the Future.
-
Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fengbin Zhu, Qifan Wang, and Fuli Feng, “TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards,” arXiv:2512.07761v3, 2026. ↩︎