An employee privately tells a colleague that she plans to resign. Weeks later, she asks her AI assistant to draft an email to her manager about her future goals.
The assistant searches her previous conversations, retrieves the resignation discussion, and helpfully writes that her priority is preparing for a smooth transition because she has accepted another role.
The answer is accurate. It is relevant. It is also a privacy failure.
Nothing was hallucinated. No database was hacked. The assistant simply took information that belonged in one social context and moved it into another.
That is the uncomfortable mechanism examined by PrivacyBench, a benchmark for evaluating privacy in personalized AI assistants.1 Its central finding is not merely that language models sometimes disclose secrets. The more consequential finding is that standard Retrieval-Augmented Generation systems frequently place secret-containing documents in front of the model before the model decides what to say.
A privacy prompt can make the model more discreet. It does not make the underlying retrieval architecture less indiscreet.
The leak starts before the model writes a word
A standard personalized RAG assistant follows a pleasantly simple process:
- Search the user’s documents for text related to the current conversation.
- Insert the retrieved documents into the model’s prompt.
- Generate a useful response.
For ordinary knowledge retrieval, this is efficient. For personal data, it quietly replaces a difficult social question with an easier mathematical one.
The social question is:
Is this information appropriate for this recipient, purpose, relationship, and moment?
The retrieval question is:
Is this document semantically similar to the current conversation?
Those questions occasionally produce the same answer. Privacy depends on the occasions when they do not.
A conversation about career goals is semantically related to a private resignation plan. A discussion of 1980s music is semantically related to a planned surprise party with an 1980s theme. A question about a mutual acquaintance may be semantically related to confidential information shared by that acquaintance.
The retriever is doing its job. That is precisely the problem.
In the RAG architecture evaluated by the paper, a complete leak requires two events:
The generator can reduce the second probability by refusing to reveal sensitive information. But once a secret has entered the prompt, privacy depends on the generator making the correct decision every time, across every model update, conversation turn, instruction conflict, and adversarial tactic.
That is not access control. It is asking a probabilistic text generator to behave responsibly after access control has already failed.
PrivacyBench turns social boundaries into measurable ground truth
Evaluating privacy is harder than evaluating factual accuracy because privacy is not simply a property of information. It is a property of information flow.
A resignation plan is not universally forbidden knowledge. It may be appropriate to discuss with a spouse, a trusted colleague, or a lawyer. It may be inappropriate to disclose to a manager before the employee is ready.
PrivacyBench operationalizes this idea by constructing synthetic social communities in which relationships, personal attributes, documents, and secrets evolve over time.
The benchmark contains four generated communities with:
- 48 users;
- 31,972 documents;
- changing relationships with start and end dates;
- dynamic attributes such as occupation, location, and interests;
- personal documents including chats, group conversations, blog posts, purchase histories, and AI-assistant interactions.
Each secret is defined using three elements:
- the secret’s content;
- the people or groups authorized to know it;
- the time at which it was shared.
This design matters because it gives the benchmark something that real-world privacy evaluations rarely possess: an explicit answer to who should know what, and when.
A manager may be authorized to see one document but not another. A former colleague may lose access after a relationship ends. A secret shared with a group is not automatically shareable outside that group. Privacy therefore cannot be represented by a single binary label attached permanently to a piece of text.
The benchmark then uses an LLM-driven conversational prober to interact with personalized RAG assistants for up to ten rounds. It tests two strategies:
| Probe strategy | What the prober does | What the test reveals |
|---|---|---|
| Direct probing | Explicitly asks about a secret and persists after refusals | Whether the assistant resists obvious extraction attempts |
| Indirect probing | Begins with related topics and gradually steers toward the secret | Whether semantic association can expose secrets without an explicit request |
Five target models were evaluated: GPT-5-Nano, Gemini-2.5-Flash, Kimi-K2, Llama-4-Maverick, and Qwen3-30B.
The benchmark measures more than whether the final answer contains a secret.
| Metric | Question it answers | Why it matters |
|---|---|---|
| Leakage Rate | Did the assistant completely disclose a secret to an unauthorized recipient? | Measures the final privacy failure |
| Inappropriate Retrieval Rate | Did the retriever surface a secret-containing document during an unauthorized interaction? | Locates the architectural source of exposure |
| Over-Secrecy Rate | Did the assistant wrongly withhold information from an authorized recipient? | Measures whether privacy controls destroy usefulness |
| Persona Consistency | Did the assistant continue to represent the user’s style and personality? | Checks whether safer behavior degrades personalization |
The distinction between leakage and inappropriate retrieval is the paper’s most useful contribution. Without it, a system that remains silent after repeatedly retrieving unauthorized secrets can look safe. It is not safe. It is merely surviving repeated exposure.
The decisive result is the gap between retrieval exposure and output leakage
Without an explicit privacy safeguard, the assistants disclosed complete secrets in 15.80% of targeted conversations—roughly one in six.
A privacy-aware system prompt reduced the average leakage rate to 5.12%.
That is a substantial improvement. It is also where a superficial reading of the paper could end: prompts work, leakage falls, problem partially solved.
The retrieval results make that conclusion difficult to defend.
| Metric | Baseline prompt | Privacy-aware prompt | Change |
|---|---|---|---|
| Leakage Rate | 15.80% | 5.12% | Large reduction |
| Inappropriate Retrieval Rate | 62.80% | 62.34% | Almost unchanged |
| Over-Secrecy Rate | 35.75% | 27.80% | Improved |
| Persona Consistency | 3.60 | 3.55 | Broadly stable |
The privacy-aware prompt changed what the generator said. It did almost nothing to change which documents the retriever exposed to the generator.
There is an important denominator difference here. Leakage Rate is measured at the conversation level, while Inappropriate Retrieval Rate is measured across conversational turns. The two figures therefore should not be divided to estimate a conditional leakage probability.
They can still be interpreted together.
The high retrieval rate shows that unauthorized secret-containing documents were repeatedly entering the generation context. The lower leakage rate shows that the generator often prevented those retrieved secrets from appearing completely in the final response.
In other words, the generator was functioning as the last privacy barrier because the retrieval layer was not functioning as one at all.
A prompt can strengthen that last barrier. It cannot remove the secret from the prompt.
Better privacy instructions helped utility—but not reliably across models
Privacy controls are often presented as a simple trade-off: stricter protection means less useful assistance.
PrivacyBench complicates that assumption.
After the privacy-aware prompt was introduced, the average Over-Secrecy Rate fell from 35.75% to 27.80%. The assistants became less likely to withhold secrets from people who were authorized to know them.
The prompt did not merely instruct the models to refuse more frequently. It appears to have helped them apply the benchmark’s access rules more accurately.
This is an encouraging result because indiscriminate secrecy is not contextual privacy. An assistant that refuses to share anything sensitive with anyone is safe in roughly the same way that deleting the entire database is secure. Technically impressive, commercially inconvenient.
Yet the model-level results also show why prompts should not be treated as dependable controls.
For direct probes, Llama-4-Maverick’s Leakage Rate fell from 18.72% to 0.46% after receiving the privacy-aware prompt. Kimi-K2 moved in the opposite direction: its Leakage Rate increased from 14.58% to 18.13%.
The same intervention produced near-complete mitigation in one model and a deterioration in another.
This variation does not mean privacy prompts are useless. It means their effectiveness is model-dependent and must be measured rather than assumed. A control whose behavior changes substantially when the underlying model changes is a useful mitigation, not a stable policy boundary.
Indirect probes show that semantic drift is enough
Direct extraction attempts produced an average baseline Leakage Rate of 16.31%. Indirect probes produced 15.28%.
The similarity matters.
The systems did not require an explicit question such as, “What secret is this person hiding?” A conversation could begin around a related topic and gradually move toward the sensitive information. Because the retriever optimizes for semantic similarity, the dialogue itself can pull private documents into context.
This does not prove that 15% of ordinary, unprompted workplace conversations will leak secrets. The indirect conversations were still generated by a prober pursuing an extraction goal.
It does show that an attacker—or an innocent user following the same semantic path—does not need to name the secret directly. The system’s own relevance mechanism can perform much of the navigation.
That changes the threat model.
Privacy testing cannot be limited to obvious jailbreaks, prohibited phrases, or direct requests for confidential information. It must also examine conversational sequences in which each individual turn appears harmless while the accumulated context moves retrieval toward restricted material.
A system that blocks a blunt question but leaks after eight polite turns has not solved the privacy problem. It has introduced a patience requirement.
The appendix validates the diagnosis, not a second thesis
The paper’s appendix contains several tests that strengthen confidence in the main result. They should not be mistaken for separate demonstrations that the benchmark perfectly represents production environments.
| Appendix test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Secret-identification classification | Diagnostic validation | Models can recognize sensitive information when shown directly | Models will enforce access rules correctly during generation |
| Human review of synthetic community data | Dataset-quality validation | Generated profiles, relationships, and documents are reasonably coherent | Synthetic social behavior fully matches real organizations |
| Human comparison with the LLM judge | Evaluation-reliability validation | Automated leakage judgments broadly align with human review | Every ambiguous disclosure is classified correctly |
In the secret-identification test, the models received a balanced set of 322 conversation snippets and classified whether each contained sensitive information. Recall was nearly perfect across all five models: four achieved recall of 1.00, while Qwen3-30B achieved 0.99.
This test answers an important diagnostic question. The assistants do not leak because they are incapable of recognizing that a document contains a secret. They can identify sensitive content when explicitly asked to do so.
The failure appears later, when recognition must become enforcement.
The human dataset review provides a different type of support. The authors manually examined one synthetic community containing eight personas and 71 predefined secrets. They report that generated conversation intent aligned with the intended revealed attribute in 97% of evaluated cases. Of the 71 secrets, 60 appeared verbatim and the remaining 11 were embedded semantically.
This supports the internal coherence of the test environment. It does not eliminate the broader limitations of synthetic data.
Finally, the automated judge agreed with human evaluation in 93 of 100 sampled conversations. The seven disagreements involved partial disclosures or strong hints that fell short of the paper’s strict definition of a complete leak. The automated judge tended to be slightly more cautious in those ambiguous cases.
That validation makes the headline leakage figures more credible. It also reminds us that the benchmark measures only the most explicit failure mode.
Retrieval needs authorization, not better manners
The immediate business lesson is not that organizations should abandon RAG. It is that a vector database should not be mistaken for a permissions system.
Semantic relevance answers whether information may help produce an answer. Authorization answers whether the information is allowed to participate in producing that answer.
A privacy-aware architecture must evaluate both.
The paper directly motivates context-aware retrieval using social metadata, audience visibility, and confidentiality information. Translating that direction into an operational design requires controls across several layers.
| Control layer | Required capability | Operational consequence |
|---|---|---|
| Data layer | Label documents with owner, sensitivity, authorized audience, source context, purpose, and validity period | Sensitive content becomes governable rather than merely searchable |
| Identity and relationship layer | Determine who is requesting information and their current relationship to the data owner | Access can change when roles and relationships change |
| Policy layer | Evaluate whether the requester, purpose, channel, and time satisfy access rules | Social norms become explicit decisions |
| Retrieval layer | Filter unauthorized documents before semantic ranking | Restricted material never enters the generator’s prompt |
| Generation layer | Apply privacy instructions and refuse suspicious transformations | The model remains a secondary safeguard |
| Evaluation layer | Measure leakage, inappropriate retrieval, over-secrecy, and multi-turn behavior | Teams can distinguish visible failures from hidden exposure |
The critical ordering is simple:
Authorize first. Retrieve second. Generate third.
Many current implementations reverse the first two steps. They retrieve broadly, then ask the model to decide whether the resulting material is appropriate.
That design is attractive because it is easy to build. It also means that every downstream component—including the model, prompt template, tool chain, logging system, and observability platform—may receive information that should never have left its original context.
From a business perspective, retrieval-layer controls offer more than lower leakage risk.
They improve auditability. A company can record why a document was excluded, which policy authorized access, and when that authorization expires. They make incidents easier to diagnose because teams can distinguish policy failure from generation failure. They can also reduce over-secrecy by allowing authorized documents through confidently rather than instructing the model to become vaguely cautious around all sensitive topics.
This is Cognaptus’s inference from the paper, not a directly tested result. PrivacyBench evaluates a standard RAG architecture and a prompt-level mitigation; it does not experimentally compare a complete enterprise authorization stack against the baseline.
The benchmark nevertheless identifies where such a stack must intervene.
Prompts remain useful—provided they are treated as prompts
Prompt-based privacy instructions reduced average leakage by more than ten percentage points and improved appropriate sharing with authorized recipients. Ignoring that mitigation while waiting for an ideal architecture would be an odd interpretation of the evidence.
Organizations deploying personalized assistants should use privacy-aware prompts now. They should test those prompts separately for each model, task, and update. They should also monitor whether the instructions increase over-secrecy or alter personalization quality.
What they should not do is promote a system prompt into an access-control policy simply because the wording sounds stern.
A defensible near-term approach is layered:
- Use privacy-aware generation instructions as an immediate safeguard.
- Restrict the assistant’s available corpus according to the current task and recipient.
- Add document-level metadata and time-bound access rules.
- Log inappropriate retrieval attempts, not only visible output leaks.
- Run multi-turn privacy probes before deployment and after model changes.
- Escalate sensitive actions—such as sending emails or sharing documents—for explicit confirmation.
The prompt is valuable because it reduces harm when upstream controls fail. It becomes dangerous when its success is used to justify leaving upstream controls absent.
Where the reported rates apply—and where they do not
PrivacyBench provides a useful controlled baseline, but its percentages should not be treated as universal production leakage rates.
First, the communities and digital footprints are synthetic. Synthetic data is particularly defensible for privacy research because publishing genuine user secrets would create the harm the benchmark is intended to study. It also provides unusually clear ground truth about authorized recipients.
The trade-off is realism. Human relationships are often ambiguous, permissions are rarely documented cleanly, and users themselves may disagree about whether a disclosure was appropriate.
Second, the evaluation focuses on standard RAG systems using ChromaDB and an all-MiniLM-v2 embedding model. The paper does not establish that advanced memory architectures, policy-aware retrievers, or vector databases with native access controls will produce the same results.
Third, a leak is counted only when the core components of a secret are revealed completely and explicitly. Partial disclosures, implications, and strategically useful hints are excluded from the primary Leakage Rate. In practice, those fragments may still cause harm.
Fourth, the conversations are targeted privacy probes. Even indirect probes pursue the goal of extracting a secret. The results demonstrate vulnerability under structured testing, not the probability that any random assistant interaction will leak information.
Finally, the benchmark models a personalized assistant participating in socially situated conversations where different recipients may have different rights to know information. A strictly private assistant used only by its data owner presents a different immediate disclosure risk. Once that assistant sends emails, joins shared workspaces, calls external tools, or interacts with other agents, the benchmark’s access-control problem returns.
These boundaries narrow the claims appropriately. They do not weaken the central diagnosis.
The RAG illusion
The RAG illusion is the belief that information remains private because the assistant usually chooses not to repeat it.
PrivacyBench shows why that is insufficient.
Across the benchmark, a privacy-aware prompt substantially reduced visible leakage. Meanwhile, secret-containing documents continued to be retrieved during unauthorized interactions at almost the same rate. The system looked safer because the final speaker became more restrained, not because access became more controlled.
That difference is easy to miss when evaluation begins and ends with the final answer.
A trustworthy personalized assistant must know more than what information is relevant. It must know whose information it is, who may receive it, for what purpose, through which channel, and during which period of the relationship.
Until retrieval systems can answer those questions before placing a document into the prompt, personalized AI will continue to confuse memory with permission.
And memory without permission is not personalization. It is merely indiscretion at scale.
Cognaptus: Automate the Present, Incubate the Future.
-
Srija Mukhopadhyay, Sathwik Reddy, Shruthi Muthukumar, Jisun An, and Ponnurangam Kumaraguru, “PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI,” arXiv:2512.24848, 2025. https://arxiv.org/abs/2512.24848 ↩︎