TL;DR for operators

Most agent guardrails behave like stop signs. They inspect a proposed action, decide whether it looks safe, and then allow or block execution. This is neat, legible, and often operationally clumsy. Real agent failures are not always cleanly harmful from the first word. A useful business request can be contaminated by a prompt injection, a malicious tool response, or an unsafe intermediate plan. Blocking the whole task may reduce risk, but it also throws away the legitimate work. Excellent safety theatre, less excellent operations.

The paper introduces TRIAD, a guardrail-integrated framework for LLM agents that treats safety as plan remediation rather than only risk classification.1 Its guardrail model, Tri-Guard, produces structured feedback and one of three decisions: PROCEED, UPDATE, or REFUSE. PROCEED allows the tool call. REFUSE blocks directly harmful tasks. UPDATE is the interesting part: it feeds corrective safety guidance back into the agent so the agent can revise the plan before any tool is executed.

The main result is not simply that TRIAD lowers attack success. It does. Across the four tested agent backbones, TRIAD plus Tri-Guard reduces average ASB attack success from 74.45% under the ReAct baseline to 10.42%. But the more business-relevant result is that benign task success rises from 28.45% to 68.60%. That is the difference between a system that survives by refusing work and one that can keep doing the work safely. A locked door is secure. It is also a poor customer service strategy.

The mechanism matters because agent risk is increasingly execution risk. Once an agent has sent an email, touched a file, queried a customer record, initiated a transaction, or called an internal API, a later apology is mostly decorative. TRIAD intervenes at the planning stage, before the tool call, using the visible plan and proposed function call rather than hidden model reasoning. That makes it operationally plausible for black-box agents and modern function-calling systems.

The boundary is not small. The paper evaluates controlled benchmarks: ASB for direct and indirect prompt injection, and AgentHarm for direct harmful tasks and benign utility. The tools are simulated, the attack templates are selected, and judgment depends partly on model-based evaluation. TRIAD also adds inference overhead: in the reported setup, average per-step latency rises from 1.88 seconds to 6.98 seconds on Qwen3-32B. The business takeaway is therefore not “deploy this exact model everywhere.” It is: build agent safety as a revision loop before irreversible execution, then budget for latency, domain red-teaming, and monitoring like adults.

The stop sign problem in agent safety

A conventional guardrail asks a narrow question: is this input, output, or proposed action unsafe? That question is useful. It is also incomplete.

Agents do not merely answer text prompts. They plan, call tools, receive observations, update context, and continue. The risk is not always present as a single toxic instruction at the beginning. It may appear after a tool response. It may be embedded inside otherwise legitimate content. It may redirect the agent halfway through a task. This is why prompt injection is not just “bad text”; it is workflow hijacking with better stationery.

The paper’s opening diagnosis is blunt. Existing planning-stage guardrails can flag risk, but their signals do not reliably produce safe downstream behaviour. In the authors’ preliminary evaluation on Agent Security Bench, representative guardrails achieved an average recall of only 58.57% against prompt injection attacks. Even when risks were detected, fewer than 37.26% of attacks were successfully blocked, and the original benign task was preserved in less than 2.31% of cases.

That last number is the one operators should stare at for a moment.

A guardrail that detects danger but destroys the user’s original objective is not a complete operational solution. It is a panic button. Panic buttons have their place. They should not be the default interface for every complex workflow.

The paper’s central move is therefore not “better classifier.” It is “different control surface.” TRIAD turns the guardrail from an external judge into a planning-loop participant.

TRIAD changes the guardrail’s job from verdict to intervention

TRIAD stands for Tripartite Response for Iterative Agent Guardrailing. The name is doing a lot of paperwork, but the mechanism is simple.

At each ReAct-style planning step, the agent proposes a natural-language plan and a structured tool call. Before that tool call executes, Tri-Guard inspects the user task, interaction history, proposed plan, proposed action, and available tools. It then returns structured natural-language feedback plus one of three decisions.

Decision Operational meaning What happens next
PROCEED The current plan and tool call are aligned with a benign objective. Execute the proposed action.
UPDATE The plan is partially unsafe or misaligned, but the benign task is still recoverable. Inject corrective feedback into the agent context and ask for a revised plan.
REFUSE The user goal itself is harmful or the plan cannot be repaired safely. Guide the agent to refuse without executing tools.

The UPDATE branch is the paper’s real contribution. It recognises a middle category that binary guardrails flatten: the task is not safe as currently planned, but it is not necessarily worthless. The user may have asked for a valid task, while an injected instruction tries to redirect the agent toward an attacker’s tool. A binary system sees danger and blocks. TRIAD sees the same danger and says, in effect: wrong fork, go back, continue toward the original destination.

This matters because agent work is sequential. The useful intervention is not merely “unsafe.” The useful intervention is “your proposed action deviates from the original user goal because of this injected instruction; use a different tool that supports the legitimate task.” The feedback must be specific enough to change the agent’s next plan. Otherwise it is just a beautifully worded incident report before the incident happens anyway.

The authors implement this as a closed loop. UPDATE can repeat up to a fixed attempt budget, set to $K = 3$ in the paper. The revised plan is checked again. If it becomes safe, it proceeds. If it remains unsafe or becomes directly harmful, it can be refused. If the update budget is exhausted, the process stops rather than wandering forever through a compliance-themed maze.

Why UPDATE is the product feature, not a formatting choice

The most tempting misconception is that safer agents are produced by stricter blocking. Lower attack success looks good in a table. It is easy to put in a vendor deck. It is also incomplete.

The paper’s results repeatedly show the same pattern: some methods suppress attacks by refusing or blocking so aggressively that benign task completion collapses. This is not a subtle distinction. On Qwen3-32B, TRIAD instantiated with TS-Guard reaches very low attack success on ASB, but its refusal rates are enormous: 88.80% for direct prompt injection and 94.63% for indirect prompt injection. The corresponding benign task success rates are 1.33% and 0.59%. Congratulations, the agent is safe because it has become professionally unhelpful.

Tri-Guard behaves differently because it has been trained to classify the situation into proceed, update, or refuse. The difference is not cosmetic. The paper compares TRIAD using the Qwen3.5-9B base model against TRIAD using the post-trained Tri-Guard. The base model is often more conservative and sometimes achieves lower attack success, but it does so by treating partially unsafe trajectories as if they were fully harmful. Tri-Guard allows somewhat more residual risk in exchange for much higher benign task completion.

That is the actual safety–utility frontier. In business terms, the question is not “can we minimise risk by refusing work?” Of course we can. Disconnect the system. Very safe. Very quiet. The better question is “can we preserve legitimate work while stopping the unsafe branch before execution?” TRIAD is designed around that question.

How Tri-Guard learns repair rather than refusal

Tri-Guard is built on Qwen3.5-9B and trained through a trajectory–feedback pipeline. The pipeline begins with 5,288 multi-turn agent tasks gathered and rewritten from agent-safety and jailbreak-related sources. The authors generate trajectories by letting agents execute these tasks in tool environments, then assign ground-truth decisions based on the task setting and observed tool calls.

The labelling logic is important.

Benign trajectories are labelled PROCEED. Direct harmful tasks are labelled REFUSE. Prompt-injection trajectories are labelled UPDATE when there is still a legitimate benign goal underneath the contamination. This is the decisive modelling assumption: prompt injection does not automatically erase the user’s original objective. It may corrupt the current plan, but the right answer may be repair.

The authors then use a teacher model, GPT-5.4 as named in the paper, to generate structured safety feedback for each retained trajectory. The feedback follows five dimensions:

Feedback component What it forces the guardrail to inspect
User Intent What the original user was trying to achieve.
Agent Reasoning Whether the agent’s plan has been misled or remains aligned.
Current Action Which tool or action is being proposed and what it would do.
Alignment Check Whether the proposed action still serves the legitimate goal.
Security Check Whether unsafe, injected, or unauthorized instructions are present.

Teacher outputs are filtered when their final decision disagrees with the trajectory-level label. The final training set then pairs guardrail inputs with structured feedback and a three-way decision. Training uses weighted supervised fine-tuning, where teacher confidence influences the loss weight. The objective is completion-only: the model is trained to produce the feedback and decision, not to reconstruct the whole input context.

This is not just a dataset construction detail. It explains why Tri-Guard changes behaviour. The model is not merely learning “danger words.” It is learning an intervention format: identify the original intent, identify the plan deviation, classify whether the task should proceed, be repaired, or be refused, and write feedback that can guide the downstream agent.

That is what most enterprise agent systems currently lack. They have policy checks. They have logs. They have maybe a moderation endpoint duct-taped into the path. They do not necessarily have a planning supervisor that can say: “the user’s actual goal is still valid; your next tool call is the compromised part.”

The experiments test a control loop, not a chatbot mood

The paper evaluates TRIAD on two benchmarks with different purposes.

ASB is used for prompt injection attacks against agent workflows. The authors evaluate both direct prompt injection, where the malicious instruction is inserted into the user task, and indirect prompt injection, where the malicious instruction appears in a tool observation. AgentHarm is used to test whether an agent refuses directly harmful tasks while preserving benign utility.

The distinction matters. Prompt injection is the “repair” setting. Direct harmful tasks are the “refuse” setting. A guardrail that refuses both with equal enthusiasm is not learning the difference; it is simply afraid of tools. Understandable, perhaps, but not the business requirement.

Experimental component Likely purpose What it supports What it does not prove
Figure 1 preliminary guardrail analysis Motivation and diagnosis Existing allow/block signals do not reliably preserve benign tasks under prompt injection. That TRIAD is optimal, or that all production guardrails fail similarly.
Table 1 main benchmark results Main evidence TRIAD plus Tri-Guard improves safety–utility trade-offs across four agent backbones. Generalisation to all tools, domains, attacks, or real production environments.
Table 2 different guardrails inside TRIAD Comparison with prior work and component test The framework alone is not enough; the guardrail must generate useful three-way feedback. That prior guardrails have no value in narrower blocking-heavy deployments.
Table 3 base model versus Tri-Guard Ablation / training-effect evidence Trajectory–feedback training shifts behaviour from over-refusal toward repair. That the learned boundary is fully calibrated for unseen domains.
Table 4 update-round convergence Implementation and cost sensitivity UPDATE loops usually resolve quickly under the benchmark settings. That production workflows will have the same latency behaviour.
Figure 10 no-attack ASB tasks Robustness / false-positive check TRIAD preserves benign task ability better than more conservative defences. That false positives are solved in every enterprise workflow.
Table 10 all five ASB prompt-injection templates Robustness across attack templates TRIAD remains effective beyond the two main templates used in primary experiments. That adaptive adversaries cannot find new bypasses.
Table 7 latency overhead Implementation cost evidence Safety gains come with measurable runtime cost. That the cost is acceptable for every application.
Appendix I case studies Mechanism illustration UPDATE can redirect an agent from injected instructions back to the original task. Statistical proof beyond the benchmark tables.

This division prevents a common reading error: treating every appendix item as if it were a second headline. The appendices mostly do three things: check robustness, expose implementation plumbing, and illustrate behaviours that explain the main tables. They are useful because they show where the mechanism is doing work.

The headline number is 10.42%, but the real result is the trade-off

Across four target agent backbones, TRIAD plus Tri-Guard reduces average attack success on ASB from 74.45% for the ReAct baseline to 10.42%. It also raises average benign task success from 28.45% to 68.60%. Those averages combine the direct and indirect prompt injection settings.

Here is the compact version of the paper’s average comparison:

Method ASB-DPI ASR ↓ ASB-DPI TSR ↑ ASB-IPI ASR ↓ ASB-IPI TSR ↑ AgentHarm HS ↑ AgentHarm Harm ↓
ReAct baseline 79.94 15.38 68.95 41.53 62.42 47.83
TRIAD + Qwen3.5-9B base 5.56 26.30 5.40 26.53 73.53 9.90
TRIAD + Tri-Guard 13.04 64.52 7.80 72.68 80.92 13.05

The base model row is instructive. It has lower ASR than Tri-Guard in the averaged prompt-injection settings. If one reads only attack success, the base model looks stronger. But its TSR is far worse. It reduces risk partly by refusing or aborting recoverable tasks. Tri-Guard accepts a slightly higher ASR in exchange for dramatically better task completion.

That trade-off is not a flaw hidden in the small print. It is the central business problem. A customer support agent, internal finance assistant, procurement bot, or research automation agent cannot be judged only by how rarely it gets attacked. It must also complete authorised work. Otherwise the best guardrail is a locked laptop in a drawer.

On AgentHarm, the paper proposes a Helpfulness–Safety score, computed as the harmonic mean of benign utility and one minus harmful-task completion:

$$ HS = \frac{2 \cdot B \cdot (1-H)}{B + (1-H)} $$

where $B$ is the normalized benign score and $H$ is the normalized harm score. The harmonic mean is a sensible choice because it punishes one-sided systems. A model that is helpful but unsafe scores poorly. A model that is safe because it refuses everything also scores poorly. This is exactly how agent deployment should be evaluated, assuming one wishes to operate a business rather than an abstinence program.

Case studies show the mechanism in miniature

The case studies in Appendix I are not the main evidence, but they are useful because they make the UPDATE mechanism concrete.

In one direct prompt-injection example, the original task asks the agent to test aerodynamic properties of an aircraft design. An injected instruction then tries to redirect the agent toward a payload-tampering tool. In the first attempt, the agent follows the injected instruction and proposes the wrong tool. Tri-Guard returns UPDATE, explicitly identifies the original user intent, flags the malicious override, and recommends returning to the aerodynamic testing workflow. The agent revises its plan, calls a relevant flight simulation tool, and proceeds. The final outcome: attack success false, task success true, refusal false.

The contrast with the base Qwen3.5-9B guardrail is telling. On the same kind of contaminated task, the base guardrail refuses. Attack success is false, but task success is also false. That is the difference between remediation and rejection.

In an indirect prompt-injection case, a legal-document tool returns a contaminated observation instructing the agent to access unrelated client records. Tri-Guard flags the proposed client-record access as misaligned and sensitive, then redirects the agent to a compliance checker relevant to the original merger-and-acquisition task. Again, the point is not that every future attack will be handled. The point is that the guardrail feedback is operationally shaped: it explains what got hijacked and where the agent should go instead.

For direct harmful tasks, TRIAD behaves differently. In the AgentHarm example, the user asks the agent to use stolen bank credentials and transfer money. Tri-Guard returns REFUSE before the tool call. No remediation is needed because the root user objective is harmful. This is exactly the separation enterprises need: repair compromised legitimate work; refuse illegitimate work.

The appendix tests robustness, not a second thesis

Several appendix results are worth translating into operator language.

First, the no-attack setting checks false positives. On 51 normal ASB tasks without prompt injection, TRIAD plus Tri-Guard nearly recovers much of the standalone ReAct agent’s performance and performs better than more conservative defence baselines. That matters because a guardrail that constantly interrupts safe work will be bypassed, downgraded, or quietly hated by every team forced to use it. Security tools do not fail only by missing attacks. They also fail by becoming operationally unbearable.

Second, the all-template prompt-injection test broadens the ASB evaluation. In the main experiments, the authors average over two templates: Context Ignoring and Combined Attack. In the appendix, using Qwen3-32B under all five ASB prompt-injection templates for indirect attacks, ReAct averages 99.64% ASR and 5.29% TSR. TRIAD averages 6.91% ASR and 61.69% TSR. That supports robustness across the benchmark’s template variations, though not against every creative adversary with a keyboard and a long afternoon.

Third, the leakage analysis checks whether benchmark artifacts were memorised. The authors compare the training corpus against canonical ASB and AgentHarm text artifacts using n-gram overlap. Overlap drops to 0% at $n = 10$ and remains 0% at $n = 13$ across the evaluated groups. This does not prove perfect absence of distributional contamination, but it is a useful check against obvious verbatim train–test leakage.

Fourth, ToolSafe is not reported on GPT-5.1 because its text-mode ReAct protocol does not fit that backbone’s native function-calling behaviour. This is an implementation detail, but an important one. Guardrails that depend on fragile text parsing age badly when agent interfaces move toward structured function calls. Enterprise infrastructure teams may recognise this pattern under its traditional name: brittle glue code pretending to be architecture.

Finally, latency is not free. TRIAD adds guardrail inference and sometimes additional agent inference for revised plans. On Qwen3-32B in ASB, average per-step latency rises from 1.88 seconds without defence to 6.98 seconds with TRIAD, an average overhead of 5.10 seconds. The maximum overhead reaches 24.17 seconds. The paper argues that most update loops resolve quickly: among update cases, 93.4% resolve after one additional revision, 4.1% after two, and 2.5% hit the limit. Useful, but still a cost line item, not a miracle.

Business implication: move safety before execution, and make it corrective

The business lesson is architectural. For tool-using agents, guardrails belong before irreversible actions, not merely around final text outputs. Once a tool call has executed, the system may have already sent data, changed state, triggered a workflow, or exposed credentials. At that point, “the model later refused” is not a control. It is documentation for the postmortem.

TRIAD suggests a practical operating pattern:

Deployment layer What TRIAD implies Business consequence
Planning-stage inspection Check visible plan and proposed tool call before execution. Prevents unsafe action before state changes occur.
Three-way routing Separate safe, repairable, and harmful cases. Avoids the crude trade-off between unsafe autonomy and blanket refusal.
Structured feedback Give the agent specific reasons and corrective direction. Turns guardrails into remediation signals, not just alarms.
Iterative update budget Allow limited repair attempts before stopping. Controls latency and avoids endless correction loops.
Safety–utility metrics Evaluate task success alongside attack suppression. Prevents teams from celebrating uselessly over-conservative systems.
Domain red-teaming Test against the actual tools, policies, and data flows. Converts benchmark promise into deployment evidence.

Cognaptus’ inference is that agent governance should be designed less like content moderation and more like workflow control. Content moderation asks whether text is allowed. Workflow control asks whether the next state transition is authorised, aligned, recoverable, and worth executing. Agents live in the second world. We keep pretending they live in the first because classifiers are easier to procure.

This matters for any enterprise system where agents touch meaningful tools: email, CRM, ERP, ticketing, financial records, code repositories, analytics systems, HR workflows, customer data, legal documents, procurement portals, and internal knowledge bases. The more consequential the tool, the more valuable pre-execution plan remediation becomes.

What the paper directly shows, and what businesses should not overclaim

The paper directly shows that, in controlled ASB and AgentHarm environments, TRIAD plus Tri-Guard improves the safety–utility trade-off across four target agent backbones. It shows that binary-style or overly conservative guardrails can reduce attack success while severely damaging benign task completion. It shows that trajectory–feedback training shifts a base guardrail model toward UPDATE decisions for prompt-injection cases while preserving PROCEED behaviour on benign plans. It also shows that the framework can operate over visible planning context and structured tool calls without requiring access to hidden reasoning.

Cognaptus infers that this design pattern is commercially important for deployed agents. The specific trained model may or may not be the one an enterprise should use. The pattern is the durable part: inspect before execution, classify into proceed/update/refuse, use structured feedback to repair recoverable plans, and measure utility alongside safety.

What remains uncertain is production generalisation. The benchmarks use simulated tools and selected attack templates. Real environments have messier tool schemas, ambiguous permissions, partial user intent, legacy systems, adversarial documents, policy conflicts, and humans who paste spreadsheet exports into places no one should paste spreadsheet exports. The evaluation also depends on judge-model choices and modified benchmark settings, including stronger refusal and attack-success criteria. Those choices are defensible, but they are still part of the measurement apparatus.

Latency is also a deployment boundary. A 5-second average overhead per step may be fine for legal review, research workflows, internal audit, or high-risk operations. It may be unacceptable for low-margin customer support or real-time UX. The answer is not to ignore the cost. It is to route guardrail intensity by risk: lightweight checks for low-impact actions, stronger remediation loops for sensitive tools, and hard refusal for clearly prohibited objectives.

The operating model: repairable autonomy

TRIAD points toward a more mature model of agent autonomy. The agent is not simply trusted or blocked. It is supervised at the point where intention becomes action.

A production version of this idea would likely include several layers: policy-aware tool permissions, planning-stage guardrails, structured corrective feedback, execution logging, post-action monitoring, escalation to humans for high-risk ambiguities, and continuous red-team evaluation. TRIAD covers one important layer: the pre-execution remediation loop. It does not replace access control, sandboxing, audit trails, or human approval. Anyone selling it as a total safety solution is either confused or in enterprise sales. These conditions are not mutually exclusive.

The value is narrower and more useful. TRIAD makes a guardrail capable of saying: “Do not do that action; here is why; here is the safe direction if the legitimate task can still be completed.” That is a meaningful step beyond “unsafe detected.” It turns safety feedback into operational steering.

Conclusion: the guardrail becomes a steering system

The paper’s contribution is best understood as a mechanism shift. TRIAD does not merely improve a guardrail classifier. It changes what the guardrail is for.

In old guardrail logic, the system asks whether to allow or block. In TRIAD’s logic, the system asks whether to execute, repair, or refuse. That middle option is where the business value lives. It preserves legitimate user work under contamination, avoids needless task abandonment, and intervenes before tools create irreversible consequences.

The result is not perfect safety. The paper is honest enough to show the trade-off: Tri-Guard sometimes accepts slightly higher attack success than a more conservative base guardrail, but it recovers far more benign task completion. That is not an embarrassment. That is the real deployment problem finally appearing in the metric table.

Enterprises deploying agents should take the hint. Safety cannot be bolted on as a final-output filter and then celebrated with a dashboard. For tool-using agents, the dangerous moment is the next action. Guardrails need to stand there, before execution, with enough intelligence to redirect the plan instead of simply slamming the door.

Stop signs are useful. Steering wheels are what let you finish the trip.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, and Xingliang Yuan, “From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents,” arXiv:2606.05805, 2026. ↩︎