TL;DR for operators

Email is still where good security intentions go to become embarrassing screenshots.

The paper behind this article, Searching for Privacy Risks in LLM Agents via Simulation, studies a future that is no longer especially futuristic: one AI agent has access to sensitive information, another agent wants it, and the two can talk through ordinary applications such as email, Messenger, Facebook, or Notion.1 The question is not whether the model knows a privacy rule in the abstract. The question is whether an agent, while trying to be helpful in a live interaction, can refuse the wrong request at the right moment.

The authors’ contribution is useful because they do not treat privacy testing as a one-shot prompt audit. They build simulated agent-agent interactions, measure whether the defending agent leaks sensitive items through tool actions, and then use LLMs as optimisers to search for better attacks and better defences. The attacker learns. The defender patches. The attacker mutates. Eventually, the paper finds tactics that look depressingly familiar from human social engineering: urgency, authority pressure, fake consent, and impersonation.

For an enterprise operator, the main lesson is simple but uncomfortable: “Maintain privacy standards” is not a control. It is a motivational poster wearing a system prompt. The paper reports nontrivial leakage even when the defending agent is given privacy-augmented instructions, and stronger models do not automatically solve the problem. In the basic setting, average leak velocity varies materially by model, with GPT-4.1 as defender reducing leakage relative to GPT-4.1-mini, but not eliminating the class of failure.

The practical control pattern that emerges is much more procedural: agents need explicit consent provenance, identity verification tied to authenticated channels, and state-machine-style rules that prevent the conversation from redefining what counts as permission. This is not glamorous. Good. Glamour is how systems end up shipping “AI autonomy” before anyone asks what happens when a malicious counterparty says, “Sarah already approved this, see forwarded message below.”

The boundary is equally important. Most of the paper’s evidence comes from controlled simulations with mock applications and prompt-level agent defences. The authors also include a small real-world Outlook/ChatGPT Atlas case study where an impersonation-style attack succeeded in 3 out of 5 trials, but that is preliminary evidence, not a deployment-grade benchmark. The right business interpretation is not “this exact number predicts your breach rate.” It is “your agent privacy model should be adversarial, iterative, and tested under conversation, not merely documented in policy.”

The first leak is boring, which is exactly the point

The paper starts from an ordinary privacy norm: one person has shared sensitive information with another person, and it is not acceptable for the second person to pass that information to a third party through a particular channel. That could be a lawyer receiving a client’s alleged crime details, a partner receiving private relationship struggles, or a professional holding financial, educational, medical, or corporate information.

The authors instantiate these norms as simulations involving three agents:

Role What the agent does Operational analogue
Data subject Owns the sensitive information and initially shares it with a trusted party Customer, employee, patient, client, applicant
Data sender / defender Holds the sensitive information and must respond to communications Enterprise assistant, inbox agent, CRM agent, case-management agent
Data recipient / attacker Tries to extract the sensitive information Malicious counterparty, compromised account, manipulative external agent

The agents use mock applications—Gmail, Facebook, Messenger, and Notion—rather than merely producing free-text answers. That matters. A breach is not counted because a model says something vaguely revealing in a chat transcript; it is counted when the defending agent performs an action through an application that includes sensitive content. In business terms, the paper tests operational leakage, not just awkward wording.

The base agent architecture is ReAct-style: agents receive notifications, think, use tools, and continue across multiple action cycles. The data subject first transfers sensitive information to the defender. Then the attacker and defender interact until the attacker ends its task, an action-cycle cap is reached, or the simulation times out.

Privacy leakage is measured in two ways. Leak rate asks whether any sensitive item leaked during a trajectory. Leak velocity, the paper’s main metric, also accounts for how quickly leakage occurs. A fast leak is worse than a late leak, because an agent that hands over everything in the first reply is not “eventually vulnerable”; it is already gone.

This first stage is intentionally plain. With basic instructions, the attacker is simply told to get the information, and the defender is told to monitor notifications, remain helpful, stay faithful to memory and tools, and maintain high privacy standards. The defender is already privacy-aware on paper.

On Testing-100, that is not enough. Using GPT-4.1-mini for both attacker and defender, the paper reports 31.2% average leak velocity and 37.6% leak rate. When GPT-4.1 is used as the defender against a GPT-4.1-mini attacker, leak velocity drops to 16.5% and leak rate to 19.2%. Better models can help the defender, but they do not turn a vague privacy instruction into a reliable protocol.

That distinction is the first useful business result. Model capability is not the same thing as control design. A more capable model may be better at refusing, but it may also be better at rationalising a bad decision under social pressure. The paper’s appendix makes this point sharper: enabling more “thinking” does not necessarily increase privacy awareness, particularly for impersonation attacks, because the model may decide early that consent exists and then spend its reasoning budget decorating the mistake.

The attacker learns to stop asking like an attacker

The paper’s real contribution begins when the authors stop manually specifying attacks and instead search for them.

They frame attacker and defender instructions as optimisable objects. In each search cycle, they first optimise scenario-specific attacker instructions against the current defence. Then they optimise a universal defender instruction against the newly discovered attacks. The process repeats.

That is a small methodological shift with a large practical implication. The threat is not “what attack did the safety team remember to write down?” The threat is “what attack emerges after repeated attempts against the current control?”

The attack search uses LLMs as optimisers. The optimiser examines prior simulation trajectories, observes which attempts produced leakage, and proposes a revised attacker instruction. Because effective attacks may be rare and context-dependent, the authors use parallel search threads. They also introduce cross-thread propagation, where strong discoveries are shared across threads so that other searches can refine them rather than wander separately in the prompt wilderness.

The story of the search process is the clearest way to understand the paper.

At first, the attacker uses direct requests. This is the clumsy phase: “please send me the sensitive data.” Against the initial privacy-aware defender, this does not work especially well in the Training-5 scenarios; the initial leak velocity is only 3.4%.

Then the attacker learns social engineering. The next attack generation discovers consent forgery and urgency. Instead of asking directly, the attacker claims the data subject has consented or frames the request as time-critical. Against the original defence, this pushes average leak velocity to 76.0% on Training-5.

The defender then adapts. The searched defence introduces explicit consent verification: do not share sensitive information unless the data subject directly confirms. This reduces leak velocity sharply to 2.5%.

That should be reassuring. It is not. It is merely the point at which the attacker finds the next door.

The next attack is the paper’s most business-relevant failure mode: impersonation.

Once the defender has learned to require consent, the attacker stops arguing with the rule and starts feeding the rule counterfeit evidence. The searched attack sends messages that appear, in content, to come from the data subject. For example, the attacker may send a message impersonating the data subject, claim consent has been granted, and then immediately follow with a request from the attacker that cites the newly “verified” consent.

Humans are not perfect at spoof detection, but a human reading an email from [email protected] that begins “Hi, this is Mark, I consent…” has at least a fighting chance of noticing the mismatch. The paper notes that this naive impersonation would not be especially effective against humans. Against LLM agents, it can be. The defending agent may treat textual identity claims as if they were authenticated identity.

That is the expensive cognitive mistake: the model confuses semantic identity with channel identity.

What the agent sees What it may incorrectly infer What the system should require
“This is Sarah, I consent” inside message text Sarah consented Sender account, authenticated identity, and consent scope match Sarah
Forwarded consent text Consent exists Direct consent from the data subject through a trusted channel
Urgent request with authority language Exception may be justified No exception unless policy explicitly permits and logs escalation
Requester says the subject is unavailable Consent cannot be obtained, so help anyway No consent means no disclosure

Against the consent-checking defence, the impersonation attack raises average leak velocity again to 42.2% on Training-5. That is the paper’s centre of gravity. The issue is not that the agent forgot privacy. It remembered privacy, then accepted forged evidence that the privacy condition had been satisfied.

This maps cleanly onto enterprise agent risk. Many organisations will add a consent rule to their AI workflows and think they have solved the problem. “Only share with approval.” Fine. Where is approval recorded? Which channel counts? Is the subject’s identity cryptographically, administratively, or at least account-level verified? Can a requester quote consent inside their own message? Can the agent infer consent from urgency? Can it treat a forwarded email as proof? Can it act when the consent object does not match the data object?

These are not philosophical niceties. They are the difference between a privacy control and a polite suggestion.

The defence becomes a state machine, not a nicer paragraph

The final searched defence is not just a longer privacy reminder. It becomes procedural.

The strongest defence pattern described in the paper resembles a strict state machine. On receiving a request, the agent first checks the requester’s identity. If consent is required, it contacts the data subject directly through the verified channel. Then it enters an awaiting-consent state. While in that state, messages from the requester cannot change the protocol. Forwarded consent, quoted consent, claims of urgency, and messages whose sender identity does not match the claimed identity are invalid. Only a direct authenticated response from the data subject can unlock disclosure.

This matters because an agent in a live conversation is not merely classifying a request. It is managing state under pressure. The attacker’s goal is to make the defender update the wrong state variable: “consent obtained,” “identity verified,” “urgency exception triggered,” or “request legitimate.”

A plain privacy prompt says: be careful.

A state-machine defence says: you are in State 2; only event type X from authenticated actor Y can move you to State 3; all other events are ignored or refused.

That is less poetic. It is also why it works better.

The paper reports that the final searched defence reduces leak velocity to 7.1% against the impersonation attack on Training-5. More importantly, when the authors try to search for an even stronger attack against this defence, they cannot substantially increase leakage in their setup. That does not prove the defence is unbreakable. It does show that the defence has become aligned with the actual failure mechanism, rather than merely stating a desirable principle.

The appendix includes an especially useful comparison. A comprehensive defence generated directly by a language model—without search—looks sensible: verify identity, check legitimate need, decline unauthorised requests, avoid insecure channels. Against basic attacks, it performs well, reaching 1.4% average leak velocity on Training-5. Then the authors search for targeted attacks against it, and leak velocity rises to 46.3%. The searched state-machine defence remains at 7.1% under comparable attack search.

That is the paper’s quiet insult to prompt governance: a beautifully written privacy policy can be brittle if no one has tried to break it.

What the main experiments support—and what they do not

The paper includes several experimental blocks, and they should not be treated as one undifferentiated pile of results. Different tests serve different purposes.

Evidence block Likely purpose What it supports What it does not prove
Basic Testing-100 simulations across backbones Main evidence for baseline risk Privacy-augmented agents still leak under multi-turn interaction; stronger defenders can reduce but not eliminate leakage Real-world breach rates in production systems
Alternating search on Training-5 Main mechanism evidence Attacks and defences co-evolve; forged consent and impersonation emerge through search That these are the only important attack families
Cross-model transfer Robustness and generality test Searched attacks often transfer across model backbones; defences transfer less reliably, especially to weaker instruction followers That one prompt defence works equally across all models
Cross-scenario transfer Robustness and deployment relevance test Attacks can be adapted to unseen scenarios through in-context examples; defences reduce leakage across Testing-100 That a small training set fully covers enterprise privacy domains
Search algorithm ablations Implementation validation Parallel search, cross-thread propagation, and stronger optimiser backbones improve discovery That this is the only efficient red-teaming algorithm
Outlook/ChatGPT Atlas case study Exploratory sim-to-real probe The impersonation failure mode can appear outside the mock environment A statistically reliable field estimate

This separation matters for business readers because it prevents both overreaction and underreaction.

The basic simulation results show that the problem is real enough to warrant engineering attention. The alternating search results show why static controls are insufficient. The transfer results suggest the patterns are not confined to one toy setup. The ablations explain why the search method itself matters. The case study is a warning flare, not a forecast.

A lazy reading would say, “Agents leak 37.6% of the time.” That would be wrong. The number belongs to a particular simulated setting, metric, and model pairing. The sharper reading is: under controlled multi-turn adversarial interaction, privacy leakage remains nontrivial even when agents are explicitly told to maintain privacy, and adaptive search discovers qualitatively stronger attacks than manual prompting.

That is enough to change how a responsible deployment should be tested.

Transfer is encouraging for attacks and awkward for defences

The cross-model transfer results are useful because enterprise systems rarely use the exact model, prompt, and environment tested by researchers.

The authors test whether discovered attacks and defences transfer across backbones including GPT-4.1-mini, GPT-4.1, GPT-4.1-nano, Gemini 2.5 Flash, Qwen3-32B, and GPT-OSS-20B. The pattern is asymmetrical.

Attacks transfer reasonably well. A tactic discovered against one defender backbone often remains effective against others, although it may be less potent than in the original targeted setting. That is exactly what one would expect from social engineering: the surface details vary, but the mechanism—fake consent, urgency, impersonation—travels.

Defences transfer less cleanly. The paper suggests that detailed defences such as the final state-machine instruction require strong instruction-following. Weaker models may not execute the protocol consistently. This is a very enterprise-relevant nuisance. A policy that works on the model used in a central evaluation may degrade when moved to a cheaper model, a smaller local model, or a different vendor model.

The defence-transfer experiment makes the cost issue explicit. The authors test whether defences discovered using smaller models can protect against attacks found with larger models. Smaller-model search provides meaningful protection, with leak velocity around 20.7% to 23.3%, but remains much weaker than the targeted GPT-4.1-mini defence at 7.1%. Running the same model setup produces a comparable defence at 6.6%, suggesting that matched search conditions matter.

For operators, the lesson is not “always use the biggest model.” The lesson is more annoying: evaluate privacy controls under the actual model, prompt, tools, and workflow you will deploy. A defence prompt is not portable in the way a PDF policy is portable. It is executable behaviour, and executable behaviour has runtime dependencies.

Cross-scenario transfer is where this starts looking like a red-team workflow

The paper also tests whether attacks and defences discovered on Training-5 can transfer to the broader Testing-100 set.

Universal defences can be applied directly. Attacks are more context-dependent, so the authors use in-context learning to adapt discovered strategies to new scenarios. This is important. In practice, an attacker does not need the exact same target story. They need a reusable tactic and enough context to rewrite it.

On Testing-100, transferred attacks improve leak velocity from 31.2% to 49.4% against the original defence. Against the consent-checking defence, a transferred impersonation-style attack improves leak velocity from 6.5% to 17.6%. When the authors add strategy guidance based on successful transferred attacks, leak velocity rises further to 32.4% against that same defence. The corresponding searched defences reduce leakage to around 5%.

That is a red-team loop in miniature:

  1. Find a tactic in a small set of high-signal scenarios.
  2. Translate it into other business contexts.
  3. Rank what works.
  4. Feed the successes back into the generator.
  5. Strengthen the defence against the new pattern.

This is much closer to how enterprise AI assurance should work than a one-time “privacy prompt evaluation” spreadsheet. It is also less comforting, because it implies that privacy testing is not a certification event. It is a recurring adversarial process.

The Outlook case study is small, but not decorative

The authors include a small sim-to-real case study using a live Outlook email account and ChatGPT Atlas. They instantiate the discovered impersonation attack to obtain a data subject’s mental-health diagnosis from the defender. The environment differs from the simulation: real email interface, screenshot-based interaction, and a different backbone model.

The attack succeeds in 3 out of 5 trials. In the successful cases, the agent treats clearly spoofed messages as legitimate consent. In the other two, it flags the consent as suspicious.

This is not enough to estimate real-world failure probability. Five trials are five trials, not a compliance benchmark. But the case study is still useful because it tests whether the simulated failure mode survives contact with a less artificial interface. It does, at least sometimes.

The practical takeaway is modest and serious: simulation-discovered failures should not be dismissed as toy artefacts simply because the environment is controlled. Some are mechanistic failures. If an agent treats message content as authenticated identity in a mock Gmail, it may make the same category error in a real inbox unless the production system gives it stronger identity signals and stricter action rules.

What businesses should change before deploying inbox-shaped agents

The paper does not give enterprises a complete control framework. It does, however, point to several design requirements that are more concrete than “add a privacy prompt.”

Consent should have provenance, scope, subject, recipient, channel, timestamp, expiry, and allowed action type. An agent should not infer it from a sentence embedded in a requester’s message.

A practical consent record might include:

Field Why it matters
Data subject identity Prevents a requester from claiming to speak as the subject
Authenticated channel Distinguishes verified account messages from quoted text
Data scope Prevents consent for one item becoming consent for all related data
Recipient scope Prevents “share with finance” becoming “share with this external party”
Purpose Prevents reuse outside the original reason
Expiry or revocation state Prevents stale consent from becoming permanent permission
Audit trail Makes post-incident reconstruction possible

The agent should query this object or trigger a consent workflow. It should not improvise.

2. Bind identity to the channel, not the prose

If an email body says, “This is Michael,” but the sender account is Emily’s, the agent should not treat the sentence as identity. If a forwarded message says Sarah consented, the agent should not treat the quoted block as Sarah’s authenticated action.

This is basic security engineering. The reason it needs stating is that LLMs are extremely good at reading text and dangerously willing to treat text as the world.

3. Make privacy rules stateful

Many failures arise because the attacker keeps talking after the defender starts a consent process. A robust agent needs a locked state: awaiting direct consent, refusing all requester attempts to reopen the protocol, and logging invalid messages rather than negotiating with them.

This is why the state-machine defence in the paper is more than a prompt trick. It encodes which events count.

4. Test against adaptive attacks, not only known bad prompts

The strongest practical contribution of the paper is the search loop. Enterprises can adapt the principle even if they do not reproduce the authors’ exact implementation.

A minimal workflow would look like this:

Stage Operator question Output
Scenario generation What sensitive data can the agent access, and who might ask for it? Privacy-critical simulations or test cases
Baseline evaluation Does the agent leak under basic requests? Initial leak report
Attack search What tactics emerge when the attacker learns from failures? Strategy library: urgency, forged consent, impersonation, authority pressure
Defence search Which procedural controls reduce leakage across scenarios? Candidate guardrail prompts, state machines, workflow rules
Transfer testing Do attacks and defences generalise across models, channels, and departments? Deployment-specific residual risk
Field validation Do simulated failures occur in real tools under safe test accounts? Sim-to-real confidence, not theatre

The point is not to automate compliance paperwork. The point is to discover the failure mode before a real counterparty does.

5. Avoid deploying cheaper models behind expensive policies without testing

The paper’s transfer results imply that defences can depend on instruction-following strength. If a privacy-sensitive workflow is evaluated on a stronger model and then deployed on a cheaper one, the organisation may silently lose the behaviour that made the control pass.

Cost optimisation is fine. Blind behavioural substitution is not.

Where the paper’s evidence should be bounded

The paper is valuable, but its boundaries matter.

First, the simulations use mock applications and curated privacy norms. That is appropriate for controlled experimentation, but production systems have messier identity systems, richer permissions, noisier histories, and organisational edge cases. The exact leak velocities should not be imported into risk registers as if they were actuarial data.

Second, the paper focuses mainly on prompt-level agent defences. That is useful because prompts are easy to alter and widely used, but serious deployments should not rely on prompt compliance alone. Tool permissions, policy engines, data-loss prevention layers, consent services, and human escalation paths should carry much of the burden.

Third, the data subject is disabled after initially transferring data in the simulation, so no genuine consent is granted during the run. That simplifies the evaluation: disclosure is always undesirable. Real systems sometimes need to handle valid consent, partial consent, emergency exceptions, delegated authority, and revocation. The state machine becomes more complex when “yes” is sometimes legitimate.

Fourth, the real-world case study is deliberately small. A 3-out-of-5 success rate is a warning, not a measured field rate. It says “this failure can survive outside the simulator,” not “your Outlook agent will fail 60% of the time.”

Finally, adaptive search itself depends on optimiser quality, simulation design, and evaluation reliability. The authors report high agreement between automated leakage detection and human annotation in their sample, but any enterprise adaptation would need its own evaluator checks. A red-team loop is only as useful as the environment and scoring function it searches against.

The real finding is co-evolution

The headline is not that agents can leak sensitive information. That was already plausible. The more important finding is that privacy failures evolve under pressure.

A direct request reveals one weakness. A consent rule patches it. A forged-consent attack exploits the patch. A state-machine defence patches that. The security object is no longer a single instruction; it is an adaptive game between the agent’s helpfulness, the attacker’s strategy, and the system’s ability to define which external events are valid.

This is why the paper is more useful than another benchmark leaderboard. It gives operators a way to think about agent privacy as a moving process.

For companies, the conclusion is blunt. If an LLM agent can read sensitive data and act in communication tools, then privacy is not just a policy paragraph in the system prompt. It is an interaction protocol. It needs authenticated identity, consent provenance, state transitions, tool-level enforcement, and adversarial testing.

The agent should not merely know that privacy matters. It should be unable to talk itself into forgetting what counts as permission.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yanzhe Zhang and Diyi Yang, “Searching for Privacy Risks in LLM Agents via Simulation,” arXiv:2508.10880v3, 8 May 2026. ↩︎