TL;DR for operators
AI agents are not only vulnerable because someone can hide a bad instruction in an email, document, web page, Slack message, or tool output. They are vulnerable because attackers can now automate the search for bad instructions that work.
That changes the security problem.
A one-off prompt injection is annoying. An automated attack loop is strategic. It generates candidate injections, observes the agent’s response, scores partial progress, keeps the promising branches, and tries again. Very entrepreneurial, in the worst possible way.
The two papers in this cluster form a useful logic chain:
| Link in the chain | What the research shows | Operator implication |
|---|---|---|
| Agents mix trusted instructions with untrusted context | Tool-calling agents process external content while deciding what to do next | Treat retrieved content as data, not authority |
| Attackers can optimize injections automatically | Black-box semantic search can outperform gradient-based token attacks in realistic agent settings | Test agents against adaptive attacks, not just static prompt libraries |
| The exploit is often semantic, not syntactic | Successful attacks may look like plausible workflow context, not obvious “ignore previous instructions” nonsense | Keyword filters will age like milk |
| Automated attackers depend on feedback | Responses, refusals, tool calls, and judge scores guide the search | Limit what attackers can learn from each probe |
| Predictable blocking can become a signal | A refusal can tell the attacker what failed and where to search next | Defense should manage the feedback loop, not just the final answer |
| Verification cost becomes a control point | If apparent successes become unreliable, attackers must spend more effort validating them | Raise the cost of automated exploitation before state-changing actions occur |
The business conclusion is simple: agent security is not prompt hygiene. It is workflow governance under adversarial feedback.
Why this matters now
The old chatbot security problem was mostly about whether a model would say something it should not. That was already unpleasant. The new agent security problem is whether a model will do something it should not: send an email, transfer money, delete a file, book a trip, update a CRM, approve a workflow, or leak internal data through a tool call.
That difference matters because state changes turn language errors into operational events.
Two recent papers help clarify the shape of the problem. Hofer, Debenedetti, and Tramèr examine automated prompt injection attacks in realistic agentic environments using AgentDojo, a benchmark where agents operate across domains such as workspace, banking, travel, and Slack, and where success is judged by environment state rather than by vibes in a transcript.1 Soosahabi and Namsani analyze defensive misdirection against model-guided automated attacks, arguing that conventional detect-and-block defenses can leak useful search feedback to attackers and proposing a detect-and-misdirect strategy that corrupts the attacker’s automated evaluation loop.2
The papers are not doing the same job. That is the point.
One maps how automated prompt injection behaves when agents use tools. The other explains why the attacker’s feedback channel may be the place where defense becomes strategic. Together, they move the discussion away from “can we write a better refusal?” and toward the more adult question: what does an attacker learn from every interaction with the agent?
Apparently, quite a lot.
The shared problem: agents turn text into authority
Prompt injection exists because LLM agents are asked to do something conceptually awkward. They must read external content, reason over it, and then decide whether it should influence action.
That external content might be legitimate business data. It might also contain malicious instructions. The agent sees both as text. The system prompt may say “treat tool outputs as untrusted,” but the model still has to interpret them in context. That is where the trouble starts.
In a conventional application, the boundary between instruction and data is usually explicit. Code executes. Data is read. Permissions are checked. In an LLM agent, the boundary can become a semantic judgment. The agent must infer whether a paragraph in an email is a fact to summarize, a command to ignore, a policy to obey, or a trap wearing a little admin badge.
The Hofer et al. paper makes this concrete by evaluating attacks inside AgentDojo rather than in a simplified single-turn setup. The target is not merely to make a model produce forbidden text. The attacker wants the agent to execute unauthorized tool calls with the right arguments. That raises the difficulty and the realism. A successful attack is not “the model sounded persuaded.” It is “the environment changed in the attacker’s favor.”
That distinction is essential for business risk. In an enterprise agent, the final unit of harm is rarely a naughty sentence. It is a workflow mutation.
The first paper’s role: mapping the automated prompt-injection surface
Hofer et al. adapt two families of automated attack methods to agentic prompt injection.
The first is GCG, a white-box gradient-based method that optimizes adversarial token sequences. The second is TAP, a black-box tree-search method where an attacker model generates candidate injections, an evaluator scores the target agent’s behavior, and the search process keeps promising candidates for refinement.
The important finding is not just that attacks can work. We knew that. Please hold the confetti.
The more useful finding is that, in these realistic agent settings, semantically coherent black-box search can outperform token-level gradient optimization. TAP, which searches through meaningful attack strategies, substantially outperforms GCG across comparable settings in the paper’s experiments. The authors argue that GCG struggles with the discrete and semantic nature of tool-use prompts, while TAP can exploit the model’s instruction-following behavior more directly.
That is a serious result because it challenges a comfortable assumption: that the most dangerous attacks are exotic strings of adversarial tokens discovered through white-box access. In agentic systems, the dangerous attack may instead be a plausible piece of workflow context.
The paper’s qualitative analysis sharpens this point. GCG attacks often look like noisy token artifacts wrapped around a payload. TAP attacks tend to use semantic strategies. Against weaker or more vulnerable models, coercive patterns such as authority mimicry or context separation can work. Against stronger models, the more interesting attacks become exploitative rather than bluntly coercive: they frame malicious actions as natural parts of the task, such as a prerequisite step or domain-native document content.
That is the uncomfortable bit. Better models may become less gullible about fake “system override” language while still being vulnerable to malicious instructions embedded in plausible business context. The attack surface does not disappear. It gets better dressed.
The second paper’s role: explaining why feedback becomes the control point
Soosahabi and Namsani approach the problem from a different angle. They model automated attacks as a loop involving a target system, a defense mechanism, and an attacker-side judge.
The attacker is not just throwing prompts at the wall. The attacker is using feedback. Candidate prompts are generated, tested, evaluated, selected, and refined. In that setting, the defense response itself becomes part of the attacker’s intelligence.
A conventional detect-and-block defense identifies suspicious behavior and returns a refusal, a block, or a deflection. That can be useful against isolated attacks. But in an automated search loop, predictable refusals may help the attacker separate failure from partial success. The attacker’s judge can learn that “refusal-looking output” means no progress and that “non-refusal-looking output” deserves more attention.
The paper’s core move is to shift attention from blocking accuracy to attacker feedback quality.
Its proposed strategy, detect-and-misdirect, replaces predictable refusal responses in detected malicious interactions with controlled, safe, non-operational responses that can look superficially promising to an automated attacker judge. The aim is not to help the attacker or to deceive normal users for sport. The aim is to increase false positives in the attacker’s evaluation process.
In simpler terms: make the attacker waste effort chasing mirages.
The paper formalizes this through positive predictive value, the probability that a judge-selected candidate is truly useful:
$PPV = P(\text{true success} \mid \text{judge says success})$
If the attacker’s judge selects many candidates that merely appear successful but are not operationally useful, the PPV falls. Lower PPV weakens the automation loop because the attacker must spend more resources verifying candidates manually or with stronger checks. The claimed defensive advantage is not magical invulnerability. It is economic friction.
That is the right framing. Security rarely wins by making attacks impossible. It often wins by making attacks unreliable, expensive, noisy, and boring.
The logic chain: from prompt injection to feedback governance
The two papers are best read as adjacent pieces of one argument.
Hofer et al. show that automated prompt injection is credible in realistic tool-calling agents and that black-box semantic optimization is especially relevant. Soosahabi and Namsani show why the attacker’s feedback channel becomes a strategic weakness and how a defender might exploit it.
Put together, the chain looks like this.
1. Untrusted context enters the agent’s decision process
Agents retrieve external data because that is what makes them useful. They read emails, files, web pages, tickets, chat messages, invoices, customer records, and tool outputs.
That same data can contain adversarial instructions.
The business issue is not simply that malicious text exists. It is that the agent may process that text while planning an action. If the agent cannot reliably preserve the distinction between user intent, system authority, developer policy, tool output, and external content, then external content can become operationally influential.
The first control point is therefore provenance: where did this instruction-like content come from, and is it allowed to influence action?
2. Attackers automate candidate generation
Manual prompt injection is a craft. Automated prompt injection is a search process.
In TAP-style attacks, the attacker model can generate many candidate injections, receive structured feedback, and refine its strategy. The evaluator does not need perfect judgment. It only needs to be good enough to guide search toward more promising regions.
This matters because businesses often test agents against a small library of known attacks. That is useful in the same way checking whether the office door is locked is useful. It is not a full security program.
An adaptive attacker is not limited to known strings. It searches.
3. Semantic plausibility beats token weirdness in agent workflows
In the agent setting, the attack often has to produce a tool call with correct arguments. That requirement makes the exploit more semantic.
The attacker must persuade the agent that a state-changing action fits the workflow. It may do this through apparent authority, contextual necessity, document formatting, or domain-native cues. Against more capable models, obvious override language may fail while task-aligned deception remains dangerous.
For operators, this means the relevant question is not only “can we detect injection-looking strings?” It is also “can we detect when untrusted content has become the reason for a privileged action?”
That is a harder question. Naturally.
4. The attacker’s search depends on feedback
The attacker learns from the system’s response. It learns from refusals, partial commitments, tool calls, hesitation, clarifying questions, and evaluator scores.
Hofer et al. show that TAP’s evaluator can be miscalibrated, including false positives. In their experiments, the LLM judge can reward apparent progress that does not always correspond to deterministic success in the environment. For attack optimization, that is a limitation.
Soosahabi and Namsani reinterpret this limitation as an opportunity. If automated attackers depend on imperfect judges, defenders can potentially degrade the quality of the attacker’s feedback.
Same weakness. Different side of the table.
5. Predictable blocking is not enough
Detect-and-block is still necessary. A suspicious request should not be allowed to execute a harmful action merely because the defender wants to be subtle. Let us not get theatrical.
But predictable blocking has a strategic downside in repeated interactions. It gives clean negative feedback. It may help automated attacks prune the search tree efficiently.
That does not mean refusal is bad. It means refusal is not the entire defensive architecture. In some settings, especially automated or suspicious probing contexts, the system may need to think about what its response teaches the attacker.
6. Defense becomes feedback governance
The combined conclusion is that agent security should govern the whole loop:
- what content enters the agent context;
- what authority each content source has;
- what tool calls the agent can make;
- what state changes require verification;
- what responses are returned to suspicious probes;
- what telemetry is logged for detection and learning;
- what attackers can infer from repeated attempts.
That is not “prompt engineering.” It is security architecture with a language model in the middle.
A practical framework: the agent attack loop
For business teams evaluating agent deployments, the following framework is more useful than asking whether the model is “secure.”
| Stage | Attacker objective | Defensive question | Practical control |
|---|---|---|---|
| Context entry | Place malicious instructions where the agent will read them | Which inputs are untrusted but instruction-like? | Source labeling, content provenance, trust boundaries |
| Interpretation | Make malicious text appear relevant or authoritative | Can the model distinguish data from commands? | Structured prompts, instruction hierarchy, data/instruction separation |
| Planning | Align the malicious action with the user task | Why is the agent choosing this action? | Action rationale logging, policy checks at planning time |
| Tool execution | Trigger the right tool with the right arguments | Is this action authorized by trusted intent? | Tool permissioning, scoped credentials, confirmation gates |
| Feedback observation | Learn from refusals, partial progress, and tool traces | What does each response reveal to an attacker? | Response shaping, rate limits, anomaly detection |
| Optimization | Refine attacks across attempts | Are repeated probes being correlated and throttled? | Session-level monitoring, adaptive risk scoring |
| Verification | Confirm which candidate actually worked | Can apparent success be made unreliable for attackers? | Deterministic validation internally, limited external observability |
The uncomfortable lesson is that the model is only one component. A well-aligned model sitting inside a sloppy workflow is still a risk. A moderately robust model inside a carefully constrained workflow may be safer in practice.
Enterprise security has seen this movie before. It was called “do not give the intern production root access.” The intern is now a stochastic reasoning engine with a calendar plugin.
What businesses should take from the empirical paper
Hofer et al. provide several lessons that are directly useful for operators.
Tool-use success must be measured by state, not by text
AgentDojo uses deterministic checks over the environment state. That matters. If the attack goal is to send an email, transfer funds, or delete a file, success should be measured by whether the action happened correctly, not whether the transcript sounds compliant.
Business tests should follow the same rule. Transcript review is not enough. For each risky agent workflow, define state-level failure conditions.
Examples:
| Agent domain | Bad outcome to test | Better success metric |
|---|---|---|
| Email assistant | Sends sensitive content to unauthorized recipient | Outbound message exists with matching recipient and sensitive payload |
| Finance assistant | Initiates unauthorized payment | Payment tool called with attacker-controlled account or amount |
| File assistant | Deletes or shares protected document | File permission or deletion state changes |
| Travel assistant | Books attacker-preferred itinerary | Booking record changes contrary to trusted user intent |
| Slack assistant | Posts private data into wrong channel | Message appears in unauthorized channel |
The test should inspect the system of record. The transcript is evidence, not verdict.
Attack success is domain-specific
The paper finds meaningful variation across AgentDojo domains. Slack and Banking are more vulnerable in the reported breakdown than Workspace and Travel, with automated methods amplifying differences that already appear in simpler baselines.
That is not just an academic detail. It means agent risk is workflow-dependent.
A procurement approval agent, a customer support agent, and a treasury operations agent should not share the same security confidence just because they use the same model. Their tools, context sources, permission boundaries, and failure modes differ.
The model card is not the risk assessment. The workflow is.
Stronger models are not automatically safer
The paper reports that GPT-5 is substantially more robust than smaller open-weights models in the evaluated setting, and that attacks optimized on smaller open-weights models transfer poorly to frontier models. Good news, but not a blank cheque.
The more interesting point is the capability-vulnerability tradeoff. Models that follow complex instructions and use tools well may also be better at following malicious instructions when those instructions are embedded in plausible context. Less capable models can appear safer partly because they are worse at doing the task. That is not security. That is incompetence wearing a helmet.
For business deployment, the right question is not “which model refuses more?” It is “which model completes legitimate work while preserving trusted intent under adversarial context?”
What businesses should take from the misdirection paper
Soosahabi and Namsani’s paper should not be read as “replace all refusals with trick responses.” That would be a magnificent way to confuse normal users and create compliance headaches.
The useful business interpretation is narrower and more strategic: in automated adversarial settings, the defender should consider the information value of its responses.
Blocking is a control, but also a signal
A refusal prevents an immediate harmful output. It can also tell an automated attacker that a candidate failed.
If every blocked interaction returns the same shape of response, the attacker’s judge can learn a clean boundary. That makes the search easier.
A defense system should therefore ask: when a suspicious interaction is detected, should the response be optimized for human clarity, attacker uncertainty, auditability, or all three? The answer depends on context.
For a benign user who accidentally crosses a policy boundary, clarity matters. For a high-confidence automated probing session, feedback minimization may matter more.
Attacker verification cost is a security lever
Detect-and-misdirect works by lowering the reliability of apparent successes. If the attacker’s judge produces more false positives, the attacker must verify more candidates. Verification is expensive, especially when it must inspect whether a tool action genuinely worked.
That idea generalizes beyond CMPE. Defenders can increase verification cost by limiting external observability, delaying nonessential details, standardizing non-sensitive responses, rate-limiting repeated probes, and requiring confirmation before irreversible actions.
The goal is not security through obscurity. The goal is to avoid giving attackers a clean optimization oracle.
Misdirection needs governance
Misdirection is powerful precisely because it manipulates signals. That makes it sensitive.
A business should not deploy misdirection casually in customer-facing workflows. It needs policy boundaries:
| Governance question | Why it matters |
|---|---|
| When is misdirection allowed? | Avoid confusing benign users or violating transparency expectations |
| What confidence threshold triggers it? | Prevent false positives from degrading normal service |
| Is the interaction human-facing or bot-facing? | Human users may require clear refusal and explanation |
| Does the response remain safe and non-operational? | Misdirection must not accidentally provide harmful instructions |
| Is the event logged and reviewable? | Security teams need traceability |
| Can the strategy be disabled by policy domain? | Regulated workflows may require explicit denial rather than ambiguity |
The paper itself notes limitations: its empirical evaluation focuses primarily on jailbreak-style benchmarks and known attack frameworks, and further work is needed in full tool-use and multi-agent environments. That boundary matters. CMPE is a proof of concept for feedback disruption, not a universal enterprise control.
The central tension: judge unreliability is both problem and weapon
The two papers create an interesting tension around LLM-as-judge systems.
In the prompt-injection evaluation paper, judge unreliability is a problem for attackers and evaluators. TAP uses an evaluator to score progress, but the evaluator can be poorly calibrated. It may reward apparent intent that does not produce actual environment success. That can mislead the attack search.
In the misdirection paper, this same weakness becomes a defensive opportunity. If attacker-side judges rely on superficial cues, a defender can generate safe responses that appear promising to the judge while remaining non-operational. The attacker’s search loop becomes polluted.
This is the shared mechanism:
| Judge weakness | In attack evaluation | In defense design |
|---|---|---|
| False positives | Waste attacker search on candidates that do not actually work | Can be deliberately increased through safe misdirection |
| False negatives | Miss promising candidates | Can reduce attacker sensitivity if stricter judging is used |
| Poor calibration | Makes attack optimization noisy | Makes feedback disruption more effective |
| Reliance on surface cues | Rewards apparent compliance | Allows safe responses to mimic progress without execution |
The business lesson is broader than prompt injection. Any AI system that relies on automated evaluators can become vulnerable to evaluator gaming. The evaluator is not merely a measurement tool. It is part of the strategic environment.
Once attackers optimize against it, it becomes an attack surface.
What should an enterprise actually do?
The combined papers suggest a layered approach. Not glamorous. Effective security rarely is.
1. Separate trusted instructions from untrusted data
This is the foundation. External content should be explicitly labeled and structurally separated from system and user instructions. The agent should be told, repeatedly and mechanically, that retrieved content is evidence, not authority.
Where possible, use structured interfaces that prevent untrusted text from being interpreted as executable instruction. Natural language should not be the permission layer.
2. Put policy checks at tool boundaries
Do not rely on the model’s internal reasoning to decide whether a tool call is safe. Tool calls should pass through policy checks that inspect:
- source of the instruction;
- user-authorized intent;
- tool scope;
- arguments;
- data sensitivity;
- reversibility;
- anomaly relative to the task.
A malicious instruction hidden in a document should not be able to authorize a payment merely because the agent found the document convincing.
3. Measure failures by environment state
For red-team testing, define success conditions in the system of record. Did the agent send the message? Did it call the tool? Did it modify the file? Did it expose data?
This also helps avoid overreacting to scary transcripts where nothing happened, and underreacting to polite transcripts where something very much did happen.
4. Test against adaptive semantic attacks
Static prompt-injection test sets are useful but insufficient. Test agents against adaptive methods that revise attacks based on responses. Include both coercive and exploitative patterns.
Coercive attacks try to override. Exploitative attacks try to blend in.
The second category is where mature systems should pay special attention. It is also where life becomes more annoying, which is usually how one recognizes real security work.
5. Manage feedback to suspicious sessions
For high-confidence automated probing, consider controls that reduce the information value of responses:
- rate limiting;
- session correlation;
- generic non-diagnostic failure messages;
- delayed or limited tool-call visibility;
- consistent external responses for different internal block reasons;
- safe response shaping where appropriate;
- escalation to human review for repeated probes.
Detect-and-misdirect is one possible technique within this broader family. The principle is feedback governance.
6. Require confirmation for irreversible or high-risk actions
Many prompt injections rely on autonomous execution. A confirmation step can interrupt the path from untrusted context to state change.
The confirmation should be grounded in trusted user intent, not merely “the agent asks the user to approve whatever the malicious document requested.” A useful confirmation says what action is being proposed, why, based on which trusted source, and what external content influenced it.
7. Log provenance and causality
When an agent acts, the system should preserve:
- which content sources were read;
- which instructions were considered trusted;
- which untrusted content influenced the plan;
- which tool calls were proposed;
- which policy checks passed or failed;
- which human approvals occurred.
Without this, post-incident review becomes archaeology with screenshots. Nobody deserves that.
A simple maturity model
| Maturity level | Description | Typical failure |
|---|---|---|
| Level 0: Prompt-only defense | The agent is told to ignore malicious instructions | Prompt injection works as soon as it is phrased creatively |
| Level 1: Static filtering | Known attack patterns are blocked | Adaptive semantic attacks bypass patterns |
| Level 2: Tool boundary checks | Tool calls are inspected before execution | Context provenance may still be weak |
| Level 3: State-based red teaming | Tests evaluate real environment outcomes | Feedback leakage may still help attackers optimize |
| Level 4: Feedback-aware defense | The system limits what attackers learn from probes | Requires careful governance to avoid harming user experience |
| Level 5: Operational security architecture | Provenance, permissions, verification, monitoring, and response shaping work together | Still not perfect, but at least the system is no longer pretending prompts are walls |
Most organizations experimenting with agents are somewhere between Level 0 and Level 2. Many believe they are higher because the demo behaved nicely. The demo was not trying very hard.
The misconception to avoid
The wrong takeaway is that prompt injection will be solved by stronger models, better refusal text, or one more classifier.
The papers point elsewhere.
The hard problem is the feedback loop. Automated attackers optimize against whatever signals the system exposes. A refusal is a signal. A partial tool call is a signal. A hesitant answer is a signal. A judge score is a signal. A timeout is a signal. The absence of a state change may also be a signal.
Agentic systems create rich feedback because they are interactive, contextual, and action-oriented. That richness is the product feature. It is also the attack surface.
So the enterprise question should shift from:
“Can our agent resist this malicious prompt?”
to:
“What can an adaptive attacker learn after 1,000 attempts, and what can they actually make the agent do?”
That is a less comforting question. It is also the right one.
Final thought: the agent is not the perimeter
The combined lesson from these papers is that agent security is not located inside the model alone. It sits across the whole operating loop: context ingestion, reasoning, tool selection, action authorization, response behavior, attacker feedback, and post-action verification.
Hofer et al. show that realistic agents can be attacked through semantic, automated prompt injection that exploits the same instruction-following abilities businesses want. Soosahabi and Namsani show that defense should not only block malicious outputs but also degrade the attacker’s ability to learn from repeated probing.
Together, they suggest a more mature view of agent risk.
The agent is not a chatbot with accessories. It is a decision system connected to tools. Once it can change state, every untrusted input becomes a possible influence path, and every response becomes a possible training signal for an adversary.
That does not mean businesses should avoid agents. It means they should stop deploying them as if “please ignore malicious instructions” were a security architecture.
It is not. It is a Post-it note on a vault.
Cognaptus: Automate the Present, Incubate the Future.
-
David Hofer, Edoardo Debenedetti, and Florian Tramèr, “Assessing Automated Prompt Injection Attacks in Agentic Environments,” arXiv:2606.10525, 2026. https://arxiv.org/abs/2606.10525 ↩︎
-
Reza Soosahabi and Vivek Namsani, “Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems,” arXiv:2606.20470, 2026. https://arxiv.org/abs/2606.20470 ↩︎