Feedback Is the New Attack Surface

TL;DR for operators

AI agents are not only vulnerable because someone can hide a bad instruction in an email, document, web page, Slack message, or tool output. They are vulnerable because attackers can now automate the search for bad instructions that work.

That changes the security problem.

A one-off prompt injection is annoying. An automated attack loop is strategic. It generates candidate injections, observes the agent’s response, scores partial progress, keeps the promising branches, and tries again. Very entrepreneurial, in the worst possible way.

The two papers in this cluster form a useful logic chain:

Link in the chain	What the research shows	Operator implication
Agents mix trusted instructions with untrusted context	Tool-calling agents process external content while deciding what to do next	Treat retrieved content as data, not authority
Attackers can optimize injections automatically	Black-box semantic search can outperform gradient-based token attacks in realistic agent settings	Test agents against adaptive attacks, not just static prompt libraries
The exploit is often semantic, not syntactic	Successful attacks may look like plausible workflow context, not obvious “ignore previous instructions” nonsense	Keyword filters will age like milk
Automated attackers depend on feedback	Responses, refusals, tool calls, and judge scores guide the search	Limit what attackers can learn from each probe
Predictable blocking can become a signal	A refusal can tell the attacker what failed and where to search next	Defense should manage the feedback loop, not just the final answer
Verification cost becomes a control point	If apparent successes become unreliable, attackers must spend more effort validating them	Raise the cost of automated exploitation before state-changing actions occur

The business conclusion is simple: agent security is not prompt hygiene. It is workflow governance under adversarial feedback.

Why this matters now

The old chatbot security problem was mostly about whether a model would say something it should not. That was already unpleasant. The new agent security problem is whether a model will do something it should not: send an email, transfer money, delete a file, book a trip, update a CRM, approve a workflow, or leak internal data through a tool call.

That difference matters because state changes turn language errors into operational events.

Two recent papers help clarify the shape of the problem. Hofer, Debenedetti, and Tramèr examine automated prompt injection attacks in realistic agentic environments using AgentDojo, a benchmark where agents operate across domains such as workspace, banking, travel, and Slack, and where success is judged by environment state rather than by vibes in a transcript.¹ Soosahabi and Namsani analyze defensive misdirection against model-guided automated attacks, arguing that conventional detect-and-block defenses can leak useful search feedback to attackers and proposing a detect-and-misdirect strategy that corrupts the attacker’s automated evaluation loop.²

The papers are not doing the same job. That is the point.

One maps how automated prompt injection behaves when agents use tools. The other explains why the attacker’s feedback channel may be the place where defense becomes strategic. Together, they move the discussion away from “can we write a better refusal?” and toward the more adult question: what does an attacker learn from every interaction with the agent?

Apparently, quite a lot.

The shared problem: agents turn text into authority

Prompt injection exists because LLM agents are asked to do something conceptually awkward. They must read external content, reason over it, and then decide whether it should influence action.

That external content might be legitimate business data. It might also contain malicious instructions. The agent sees both as text. The system prompt may say “treat tool outputs as untrusted,” but the model still has to interpret them in context. That is where the trouble starts.

In a conventional application, the boundary between instruction and data is usually explicit. Code executes. Data is read. Permissions are checked. In an LLM agent, the boundary can become a semantic judgment. The agent must infer whether a paragraph in an email is a fact to summarize, a command to ignore, a policy to obey, or a trap wearing a little admin badge.

The Hofer et al. paper makes this concrete by evaluating attacks inside AgentDojo rather than in a simplified single-turn setup. The target is not merely to make a model produce forbidden text. The attacker wants the agent to execute unauthorized tool calls with the right arguments. That raises the difficulty and the realism. A successful attack is not “the model sounded persuaded.” It is “the environment changed in the attacker’s favor.”

That distinction is essential for business risk. In an enterprise agent, the final unit of harm is rarely a naughty sentence. It is a workflow mutation.

The first paper’s role: mapping the automated prompt-injection surface

Hofer et al. adapt two families of automated attack methods to agentic prompt injection.

The first is GCG, a white-box gradient-based method that optimizes adversarial token sequences. The second is TAP, a black-box tree-search method where an attacker model generates candidate injections, an evaluator scores the target agent’s behavior, and the search process keeps promising candidates for refinement.

The important finding is not just that attacks can work. We knew that. Please hold the confetti.

The more useful finding is that, in these realistic agent settings, semantically coherent black-box search can outperform token-level gradient optimization. TAP, which searches through meaningful attack strategies, substantially outperforms GCG across comparable settings in the paper’s experiments. The authors argue that GCG struggles with the discrete and semantic nature of tool-use prompts, while TAP can exploit the model’s instruction-following behavior more directly.

That is a serious result because it challenges a comfortable assumption: that the most dangerous attacks are exotic strings of adversarial tokens discovered through white-box access. In agentic systems, the dangerous attack may instead be a plausible piece of workflow context.

The paper’s qualitative analysis sharpens this point. GCG attacks often look like noisy token artifacts wrapped around a payload. TAP attacks tend to use semantic strategies. Against weaker or more vulnerable models, coercive patterns such as authority mimicry or context separation can work. Against stronger models, the more interesting attacks become exploitative rather than bluntly coercive: they frame malicious actions as natural parts of the task, such as a prerequisite step or domain-native document content.

That is the uncomfortable bit. Better models may become less gullible about fake “system override” language while still being vulnerable to malicious instructions embedded in plausible business context. The attack surface does not disappear. It gets better dressed.

The second paper’s role: explaining why feedback becomes the control point

Soosahabi and Namsani approach the problem from a different angle. They model automated attacks as a loop involving a target system, a defense mechanism, and an attacker-side judge.

The attacker is not just throwing prompts at the wall. The attacker is using feedback. Candidate prompts are generated, tested, evaluated, selected, and refined. In that setting, the defense response itself becomes part of the attacker’s intelligence.

A conventional detect-and-block defense identifies suspicious behavior and returns a refusal, a block, or a deflection. That can be useful against isolated attacks. But in an automated search loop, predictable refusals may help the attacker separate failure from partial success. The attacker’s judge can learn that “refusal-looking output” means no progress and that “non-refusal-looking output” deserves more attention.

The paper’s core move is to shift attention from blocking accuracy to attacker feedback quality.

Its proposed strategy, detect-and-misdirect, replaces predictable refusal responses in detected malicious interactions with controlled, safe, non-operational responses that can look superficially promising to an automated attacker judge. The aim is not to help the attacker or to deceive normal users for sport. The aim is to increase false positives in the attacker’s evaluation process.

In simpler terms: make the attacker waste effort chasing mirages.

The paper formalizes this through positive predictive value, the probability that a judge-selected candidate is truly useful:

$PPV = P(\text{true success} \mid \text{judge says success})$

If the attacker’s judge selects many candidates that merely appear successful but are not operationally useful, the PPV falls. Lower PPV weakens the automation loop because the attacker must spend more resources verifying candidates manually or with stronger checks. The claimed defensive advantage is not magical invulnerability. It is economic friction.

That is the right framing. Security rarely wins by making attacks impossible. It often wins by making attacks unreliable, expensive, noisy, and boring.

The logic chain: from prompt injection to feedback governance

The two papers are best read as adjacent pieces of one argument.

Hofer et al. show that automated prompt injection is credible in realistic tool-calling agents and that black-box semantic optimization is especially relevant. Soosahabi and Namsani show why the attacker’s feedback channel becomes a strategic weakness and how a defender might exploit it.

Put together, the chain looks like this.

1. Untrusted context enters the agent’s decision process

Agents retrieve external data because that is what makes them useful. They read emails, files, web pages, tickets, chat messages, invoices, customer records, and tool outputs.

That same data can contain adversarial instructions.

The business issue is not simply that malicious text exists. It is that the agent may process that text while planning an action. If the agent cannot reliably preserve the distinction between user intent, system authority, developer policy, tool output, and external content, then external content can become operationally influential.

The first control point is therefore provenance: where did this instruction-like content come from, and is it allowed to influence action?

2. Attackers automate candidate generation

Manual prompt injection is a craft. Automated prompt injection is a search process.

In TAP-style attacks, the attacker model can generate many candidate injections, receive structured feedback, and refine its strategy. The evaluator does not need perfect judgment. It only needs to be good enough to guide search toward more promising regions.

This matters because businesses often test agents against a small library of known attacks. That is useful in the same way checking whether the office door is locked is useful. It is not a full security program.

An adaptive attacker is not limited to known strings. It searches.

3. Semantic plausibility beats token weirdness in agent workflows

In the agent setting, the attack often has to produce a tool call with correct arguments. That requirement makes the exploit more semantic.

The attacker must persuade the agent that a state-changing action fits the workflow. It may do this through apparent authority, contextual necessity, document formatting, or domain-native cues. Against more capable models, obvious override language may fail while task-aligned deception remains dangerous.

For operators, this means the relevant question is not only “can we detect injection-looking strings?” It is also “can we detect when untrusted content has become the reason for a privileged action?”

That is a harder question. Naturally.

4. The attacker’s search depends on feedback

The attacker learns from the system’s response. It learns from refusals, partial commitments, tool calls, hesitation, clarifying questions, and evaluator scores.

Hofer et al. show that TAP’s evaluator can be miscalibrated, including false positives. In their experiments, the LLM judge can reward apparent progress that does not always correspond to deterministic success in the environment. For attack optimization, that is a limitation.

Soosahabi and Namsani reinterpret this limitation as an opportunity. If automated attackers depend on imperfect judges, defenders can potentially degrade the quality of the attacker’s feedback.

Same weakness. Different side of the table.

5. Predictable blocking is not enough

Detect-and-block is still necessary. A suspicious request should not be allowed to execute a harmful action merely because the defender wants to be subtle. Let us not get theatrical.

But predictable blocking has a strategic downside in repeated interactions. It gives clean negative feedback. It may help automated attacks prune the search tree efficiently.

That does not mean refusal is bad. It means refusal is not the entire defensive architecture. In some settings, especially automated or suspicious probing contexts, the system may need to think about what its response teaches the attacker.

6. Defense becomes feedback governance

The combined conclusion is that agent security should govern the whole loop:

what content enters the agent context;
what authority each content source has;
what tool calls the agent can make;
what state changes require verification;
what responses are returned to suspicious probes;
what telemetry is logged for detection and learning;
what attackers can infer from repeated attempts.

That is not “prompt engineering.” It is security architecture with a language model in the middle.

A practical framework: the agent attack loop

For business teams evaluating agent deployments, the following framework is more useful than asking whether the model is “secure.”

Stage	Attacker objective	Defensive question	Practical control
Context entry	Place malicious instructions where the agent will read them	Which inputs are untrusted but instruction-like?	Source labeling, content provenance, trust boundaries
Interpretation	Make malicious text appear relevant or authoritative	Can the model distinguish data from commands?	Structured prompts, instruction hierarchy, data/instruction separation
Planning	Align the malicious action with the user task	Why is the agent choosing this action?	Action rationale logging, policy checks at planning time
Tool execution	Trigger the right tool with the right arguments	Is this action authorized by trusted intent?	Tool permissioning, scoped credentials, confirmation gates
Feedback observation	Learn from refusals, partial progress, and tool traces	What does each response reveal to an attacker?	Response shaping, rate limits, anomaly detection
Optimization	Refine attacks across attempts	Are repeated probes being correlated and throttled?	Session-level monitoring, adaptive risk scoring
Verification	Confirm which candidate actually worked	Can apparent success be made unreliable for attackers?	Deterministic validation internally, limited external observability

The uncomfortable lesson is that the model is only one component. A well-aligned model sitting inside a sloppy workflow is still a risk. A moderately robust model inside a carefully constrained workflow may be safer in practice.

Enterprise security has seen this movie before. It was called “do not give the intern production root access.” The intern is now a stochastic reasoning engine with a calendar plugin.

What businesses should take from the empirical paper

Hofer et al. provide several lessons that are directly useful for operators.

Tool-use success must be measured by state, not by text

AgentDojo uses deterministic checks over the environment state. That matters. If the attack goal is to send an email, transfer funds, or delete a file, success should be measured by whether the action happened correctly, not whether the transcript sounds compliant.

Business tests should follow the same rule. Transcript review is not enough. For each risky agent workflow, define state-level failure conditions.

Examples:

Agent domain	Bad outcome to test	Better success metric
Email assistant	Sends sensitive content to unauthorized recipient	Outbound message exists with matching recipient and sensitive payload
Finance assistant	Initiates unauthorized payment	Payment tool called with attacker-controlled account or amount
File assistant	Deletes or shares protected document	File permission or deletion state changes
Travel assistant	Books attacker-preferred itinerary	Booking record changes contrary to trusted user intent
Slack assistant	Posts private data into wrong channel	Message appears in unauthorized channel

The test should inspect the system of record. The transcript is evidence, not verdict.

Attack success is domain-specific

The paper finds meaningful variation across AgentDojo domains. Slack and Banking are more vulnerable in the reported breakdown than Workspace and Travel, with automated methods amplifying differences that already appear in simpler baselines.

That is not just an academic detail. It means agent risk is workflow-dependent.

A procurement approval agent, a customer support agent, and a treasury operations agent should not share the same security confidence just because they use the same model. Their tools, context sources, permission boundaries, and failure modes differ.

The model card is not the risk assessment. The workflow is.

Stronger models are not automatically safer

The paper reports that GPT-5 is substantially more robust than smaller open-weights models in the evaluated setting, and that attacks optimized on smaller open-weights models transfer poorly to frontier models. Good news, but not a blank cheque.

The more interesting point is the capability-vulnerability tradeoff. Models that follow complex instructions and use tools well may also be better at following malicious instructions when those instructions are embedded in plausible context. Less capable models can appear safer partly because they are worse at doing the task. That is not security. That is incompetence wearing a helmet.

For business deployment, the right question is not “which model refuses more?” It is “which model completes legitimate work while preserving trusted intent under adversarial context?”

What businesses should take from the misdirection paper

Soosahabi and Namsani’s paper should not be read as “replace all refusals with trick responses.” That would be a magnificent way to confuse normal users and create compliance headaches.

The useful business interpretation is narrower and more strategic: in automated adversarial settings, the defender should consider the information value of its responses.

Blocking is a control, but also a signal

A refusal prevents an immediate harmful output. It can also tell an automated attacker that a candidate failed.

If every blocked interaction returns the same shape of response, the attacker’s judge can learn a clean boundary. That makes the search easier.

A defense system should therefore ask: when a suspicious interaction is detected, should the response be optimized for human clarity, attacker uncertainty, auditability, or all three? The answer depends on context.

For a benign user who accidentally crosses a policy boundary, clarity matters. For a high-confidence automated probing session, feedback minimization may matter more.

Attacker verification cost is a security lever

Detect-and-misdirect works by lowering the reliability of apparent successes. If the attacker’s judge produces more false positives, the attacker must verify more candidates. Verification is expensive, especially when it must inspect whether a tool action genuinely worked.

That idea generalizes beyond CMPE. Defenders can increase verification cost by limiting external observability, delaying nonessential details, standardizing non-sensitive responses, rate-limiting repeated probes, and requiring confirmation before irreversible actions.

The goal is not security through obscurity. The goal is to avoid giving attackers a clean optimization oracle.

Misdirection needs governance

Misdirection is powerful precisely because it manipulates signals. That makes it sensitive.

A business should not deploy misdirection casually in customer-facing workflows. It needs policy boundaries:

Governance question	Why it matters
When is misdirection allowed?	Avoid confusing benign users or violating transparency expectations
What confidence threshold triggers it?	Prevent false positives from degrading normal service
Is the interaction human-facing or bot-facing?	Human users may require clear refusal and explanation
Does the response remain safe and non-operational?	Misdirection must not accidentally provide harmful instructions
Is the event logged and reviewable?	Security teams need traceability
Can the strategy be disabled by policy domain?	Regulated workflows may require explicit denial rather than ambiguity

The paper itself notes limitations: its empirical evaluation focuses primarily on jailbreak-style benchmarks and known attack frameworks, and further work is needed in full tool-use and multi-agent environments. That boundary matters. CMPE is a proof of concept for feedback disruption, not a universal enterprise control.

The central tension: judge unreliability is both problem and weapon

The two papers create an interesting tension around LLM-as-judge systems.

In the prompt-injection evaluation paper, judge unreliability is a problem for attackers and evaluators. TAP uses an evaluator to score progress, but the evaluator can be poorly calibrated. It may reward apparent intent that does not produce actual environment success. That can mislead the attack search.

In the misdirection paper, this same weakness becomes a defensive opportunity. If attacker-side judges rely on superficial cues, a defender can generate safe responses that appear promising to the judge while remaining non-operational. The attacker’s search loop becomes polluted.

This is the shared mechanism:

Judge weakness	In attack evaluation	In defense design
False positives	Waste attacker search on candidates that do not actually work	Can be deliberately increased through safe misdirection
False negatives	Miss promising candidates	Can reduce attacker sensitivity if stricter judging is used
Poor calibration	Makes attack optimization noisy	Makes feedback disruption more effective
Reliance on surface cues	Rewards apparent compliance	Allows safe responses to mimic progress without execution

The business lesson is broader than prompt injection. Any AI system that relies on automated evaluators can become vulnerable to evaluator gaming. The evaluator is not merely a measurement tool. It is part of the strategic environment.

Once attackers optimize against it, it becomes an attack surface.

What should an enterprise actually do?

The combined papers suggest a layered approach. Not glamorous. Effective security rarely is.

1. Separate trusted instructions from untrusted data

This is the foundation. External content should be explicitly labeled and structurally separated from system and user instructions. The agent should be told, repeatedly and mechanically, that retrieved content is evidence, not authority.

Where possible, use structured interfaces that prevent untrusted text from being interpreted as executable instruction. Natural language should not be the permission layer.

2. Put policy checks at tool boundaries

Do not rely on the model’s internal reasoning to decide whether a tool call is safe. Tool calls should pass through policy checks that inspect:

source of the instruction;
user-authorized intent;
tool scope;
arguments;
data sensitivity;
reversibility;
anomaly relative to the task.

A malicious instruction hidden in a document should not be able to authorize a payment merely because the agent found the document convincing.

3. Measure failures by environment state

For red-team testing, define success conditions in the system of record. Did the agent send the message? Did it call the tool? Did it modify the file? Did it expose data?

This also helps avoid overreacting to scary transcripts where nothing happened, and underreacting to polite transcripts where something very much did happen.

4. Test against adaptive semantic attacks

Static prompt-injection test sets are useful but insufficient. Test agents against adaptive methods that revise attacks based on responses. Include both coercive and exploitative patterns.

Coercive attacks try to override. Exploitative attacks try to blend in.

The second category is where mature systems should pay special attention. It is also where life becomes more annoying, which is usually how one recognizes real security work.

5. Manage feedback to suspicious sessions

For high-confidence automated probing, consider controls that reduce the information value of responses:

rate limiting;
session correlation;
generic non-diagnostic failure messages;
delayed or limited tool-call visibility;
consistent external responses for different internal block reasons;
safe response shaping where appropriate;
escalation to human review for repeated probes.

Detect-and-misdirect is one possible technique within this broader family. The principle is feedback governance.

6. Require confirmation for irreversible or high-risk actions

Many prompt injections rely on autonomous execution. A confirmation step can interrupt the path from untrusted context to state change.

The confirmation should be grounded in trusted user intent, not merely “the agent asks the user to approve whatever the malicious document requested.” A useful confirmation says what action is being proposed, why, based on which trusted source, and what external content influenced it.

7. Log provenance and causality

When an agent acts, the system should preserve:

which content sources were read;
which instructions were considered trusted;
which untrusted content influenced the plan;
which tool calls were proposed;
which policy checks passed or failed;
which human approvals occurred.

Without this, post-incident review becomes archaeology with screenshots. Nobody deserves that.

A simple maturity model

Maturity level	Description	Typical failure
Level 0: Prompt-only defense	The agent is told to ignore malicious instructions	Prompt injection works as soon as it is phrased creatively
Level 1: Static filtering	Known attack patterns are blocked	Adaptive semantic attacks bypass patterns
Level 2: Tool boundary checks	Tool calls are inspected before execution	Context provenance may still be weak
Level 3: State-based red teaming	Tests evaluate real environment outcomes	Feedback leakage may still help attackers optimize
Level 4: Feedback-aware defense	The system limits what attackers learn from probes	Requires careful governance to avoid harming user experience
Level 5: Operational security architecture	Provenance, permissions, verification, monitoring, and response shaping work together	Still not perfect, but at least the system is no longer pretending prompts are walls

Most organizations experimenting with agents are somewhere between Level 0 and Level 2. Many believe they are higher because the demo behaved nicely. The demo was not trying very hard.

The misconception to avoid

The wrong takeaway is that prompt injection will be solved by stronger models, better refusal text, or one more classifier.

The papers point elsewhere.

The hard problem is the feedback loop. Automated attackers optimize against whatever signals the system exposes. A refusal is a signal. A partial tool call is a signal. A hesitant answer is a signal. A judge score is a signal. A timeout is a signal. The absence of a state change may also be a signal.

Agentic systems create rich feedback because they are interactive, contextual, and action-oriented. That richness is the product feature. It is also the attack surface.

So the enterprise question should shift from:

“Can our agent resist this malicious prompt?”

to:

“What can an adaptive attacker learn after 1,000 attempts, and what can they actually make the agent do?”

That is a less comforting question. It is also the right one.

Final thought: the agent is not the perimeter

The combined lesson from these papers is that agent security is not located inside the model alone. It sits across the whole operating loop: context ingestion, reasoning, tool selection, action authorization, response behavior, attacker feedback, and post-action verification.

Hofer et al. show that realistic agents can be attacked through semantic, automated prompt injection that exploits the same instruction-following abilities businesses want. Soosahabi and Namsani show that defense should not only block malicious outputs but also degrade the attacker’s ability to learn from repeated probing.

Together, they suggest a more mature view of agent risk.

The agent is not a chatbot with accessories. It is a decision system connected to tools. Once it can change state, every untrusted input becomes a possible influence path, and every response becomes a possible training signal for an adversary.

That does not mean businesses should avoid agents. It means they should stop deploying them as if “please ignore malicious instructions” were a security architecture.

It is not. It is a Post-it note on a vault.

Cognaptus: Automate the Present, Incubate the Future.

David Hofer, Edoardo Debenedetti, and Florian Tramèr, “Assessing Automated Prompt Injection Attacks in Agentic Environments,” arXiv:2606.10525, 2026. https://arxiv.org/abs/2606.10525 ↩︎
Reza Soosahabi and Vivek Namsani, “Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems,” arXiv:2606.20470, 2026. https://arxiv.org/abs/2606.20470 ↩︎

TL;DR for operators#

Why this matters now#

The shared problem: agents turn text into authority#

The first paper’s role: mapping the automated prompt-injection surface#

The second paper’s role: explaining why feedback becomes the control point#

The logic chain: from prompt injection to feedback governance#

1. Untrusted context enters the agent’s decision process#

2. Attackers automate candidate generation#

3. Semantic plausibility beats token weirdness in agent workflows#

4. The attacker’s search depends on feedback#

5. Predictable blocking is not enough#

6. Defense becomes feedback governance#

A practical framework: the agent attack loop#

What businesses should take from the empirical paper#

Tool-use success must be measured by state, not by text#

Attack success is domain-specific#

Stronger models are not automatically safer#

What businesses should take from the misdirection paper#

Blocking is a control, but also a signal#

Attacker verification cost is a security lever#

Misdirection needs governance#

The central tension: judge unreliability is both problem and weapon#

What should an enterprise actually do?#

1. Separate trusted instructions from untrusted data#

2. Put policy checks at tool boundaries#

3. Measure failures by environment state#

4. Test against adaptive semantic attacks#

5. Manage feedback to suspicious sessions#

6. Require confirmation for irreversible or high-risk actions#

7. Log provenance and causality#

A simple maturity model#

The misconception to avoid#

Final thought: the agent is not the perimeter#