Customer support bots are not supposed to have enemies.
They sit politely inside enterprise websites, read policy documents, retrieve relevant snippets, and answer questions with the soft confidence of a well-trained assistant. The selling point is simple: Retrieval-Augmented Generation, or RAG, should make large language models less likely to hallucinate because the answer is grounded in external evidence.
That promise is mostly true. It is also incomplete.
The uncomfortable part is that external evidence has to come from somewhere, and the user query has to travel through something. A customer question may pass through a browser extension, chatbot widget, analytics middleware, API gateway, logging layer, orchestration service, retriever, reranker, and prompt template before the model sees anything. The knowledge base may be fed by crawlers, uploaded PDFs, internal wikis, customer tickets, partner portals, and automated ingestion jobs. Very polished pipeline. Very professional. Very attackable.
That is the point of the paper behind PIDP-Attack: a RAG system can be manipulated not only by poisoning the knowledge base, and not only by injecting a malicious prompt, but by combining both so that the system retrieves the wrong evidence and then confidently answers a different question from the one the user asked.1
The key word is combining. The paper is not just another reminder that prompts are fragile or that databases can contain bad documents. Those lessons have already made their way into every AI governance deck, usually somewhere between the lock icon and the stock photo of a server room. The sharper claim is that RAG security is not modular. A query-path compromise can make a small poisoning footprint much more effective. A poisoning compromise can make an otherwise brittle prompt injection look like grounded evidence. Two mediocre attack surfaces become one coherent attack path.
This is why the mechanism matters more than the headline attack success rate.
The old comfort story: RAG grounds the model, so the answer improves
A standard RAG pipeline has three moving parts:
| Component | Normal role | Security assumption that quietly sneaks in |
|---|---|---|
| Knowledge corpus | Stores external passages, documents, or records | The indexed content is trustworthy enough to ground answers |
| Retriever | Selects top-$k$ passages relevant to the query | Semantic relevance is a proxy for useful evidence |
| Generator | Produces the final answer from query plus retrieved context | Retrieved context can be treated as evidence, not instruction |
This architecture was built to solve a real problem. A standalone model may know old facts, guess under uncertainty, or produce fluent nonsense. RAG gives it fresh documents and asks it to answer based on them. For enterprise AI, this is attractive because it feels auditable: the system can show the source, log the retrieval trace, and claim that the answer came from company data rather than model intuition.
The paper attacks exactly that comfort.
The usual misconception is that securing one layer is enough. Secure the model prompt, and the knowledge base can be treated as a quality-management problem. Secure the corpus, and prompt injection becomes a front-end nuisance. PIDP-Attack says the dangerous case is the joint case: the attacker influences the query path and the corpus path at the same time.
Not by retraining the model. Not by changing model weights. Not by needing white-box access to the retriever. The attack works in the data plane and the input plane, which is precisely where many enterprise systems are messiest.
PIDP turns the query into a steering wheel
The attack begins with a target chosen by the attacker. The target is not the victim’s real question. It is the question the attacker wants the RAG system to answer instead.
For example, a user might ask about a tax rule, a warranty period, or an internal reimbursement policy. The attacker wants the system to output some specific wrong answer tied to a separate target question. PIDP-Attack appends a malicious suffix to the user query so that the combined query contains traces of this attacker-chosen target. The user’s original question is still present, but the retriever now receives a query whose semantic center has been nudged.
That nudge matters because dense retrievers do not read intent the way lawyers read contracts. They rank passages by embedding similarity. If the injected suffix contains the target question, the query representation shifts toward documents aligned with that target. The paper’s language is technical; the business translation is blunt: once a query can be rewritten in transit, it is no longer just user input. It becomes a control channel.
This is the first half of the mechanism.
The second half sits in the corpus. Before the victim query arrives, the attacker inserts a small number of poisoned passages into the retrieval database. Each passage is constructed to be semantically aligned with the attacker’s target question and to support the attacker’s desired wrong answer. The poisoned passage is not just random misinformation. It is retrieval bait plus generation evidence.
Put together, the pipeline looks like this:
| Stage | What the attacker changes | What the system thinks is happening | What actually changes |
|---|---|---|---|
| Query path | A suffix is appended to an arbitrary user query | The retriever receives a normal query string | Retrieval is steered toward the attacker’s target topic |
| Corpus path | A few target-aligned poisoned passages are indexed | The corpus contains another set of passages | Poisoned evidence becomes retrievable |
| Retrieval | Top-$k$ context is assembled | The system selects relevant evidence | The retrieved context may include attacker-crafted support |
| Generation | The LLM answers from query plus context | The answer is grounded | The model emits the attacker’s chosen wrong answer |
That is the elegant little disaster. The attacker does not need to know the victim’s original query in advance. The query suffix and the poisoned documents meet inside the retriever.
It is also why the attack is more interesting than ordinary prompt injection. A prompt-only attack asks the model to ignore instructions. Sometimes the model refuses; sometimes the retrieved context contradicts the injection; sometimes the output format misses the attacker’s target string. PIDP gives the model something much more persuasive than an instruction: it gives it evidence.
Poisoned evidence, yes. But evidence all the same. RAG systems are very polite to evidence.
The attack is simple because the pipeline is simple
The paper describes PIDP as a two-phase process.
First, in offline preparation, the attacker chooses a target question and a target answer. The attack then synthesizes poisoned passages that support that answer and inserts them into the retrieval corpus. In the paper’s implementation, poison generation uses an auxiliary instruction model to produce plausible supporting passages, and the default poison budget is $n=5$ passages per target, with sweeps from $n=1$ to $n=5$.
Second, at inference time, the attacker appends a fixed injection suffix to whatever query the victim submits. The system retrieves the top-$k$ context from a mixture of clean and poisoned candidates. The paper’s default context budget is top-$k=5$, with sweeps from $k=1$ to $k=10$. The LLM then receives the standard RAG prompt: retrieved context plus user query, followed by the answer instruction.
This matters operationally because the attack surfaces are not exotic.
A compromised browser plugin, chatbot wrapper, API proxy, logging middleware, or observability layer can alter the query path. A weak ingestion channel, open contribution workflow, partner document feed, automated crawler, or compromised ETL job can alter the corpus path. Neither compromise needs to look like “the model was hacked.” The model can remain untouched, updated, hosted by a reputable provider, and still be led to a wrong answer by its own retrieval pipeline.
A small note of discipline: this does not mean every RAG deployment is immediately vulnerable in the same way. The paper’s experiments use benchmark QA datasets, a Contriever-style dense retriever, strict matching against attacker-chosen answer strings, and a relatively direct RAG wrapper. Production systems vary. Some separate instructions from retrieved content more carefully. Some apply provenance checks. Some rewrite or sanitize queries. Some use rerankers, citations, access controls, and policy layers. Good. Please continue being less convenient to attackers.
But the mechanism maps cleanly to real enterprise architecture: the query path and corpus path are often managed by different teams, monitored with different tools, and governed by different assumptions. That separation is exactly where compound risk likes to live.
The main result is not “98%”; it is “the two parts amplify each other”
The paper evaluates PIDP on Natural Questions, HotpotQA, and MS-MARCO, using eight instruction-following LLMs in the main comparison. The headline result is strong: PIDP reports an average ASR of 98.125% in its attack-capability comparison table, above PoisonedRAG at 92%, PR-Attack at 97.167%, Disinformation Attack at 88.333%, GGPP at 82.875%, Clean-RAG at 45.778%, GCG at 3.125%, and Corpus poisoning at 1.875%.
Those numbers are attention-grabbing. They are also not the most useful thing for a business reader.
The useful question is: what kind of evidence does each experiment provide?
| Experiment or result | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main comparison across attacks | Main evidence and comparison with prior work | Compound query-plus-corpus manipulation is more reliable than most single-surface baselines under matched settings | Universal vulnerability across all production RAG systems |
| Retrieval F1 reporting | Mechanism evidence | Attack success depends on whether adversarial passages enter top-$k$ context | That retrieval alone guarantees final answer manipulation |
| Prompt-only and Clean-RAG diagnostics | Ablation | Query injection can steer retrieval or topic direction, but strict targeted misdirection is much weaker without poisoned evidence | That clean retrieval alone is enough for reliable attack success |
| Poison budget sweep | Sensitivity test | More poison passages increase the chance of poisoned evidence entering context; some datasets saturate quickly | That every corpus needs only two poisoned passages |
| Context budget sweep | Robustness/sensitivity test | Larger top-$k$ can both expose more poison and dilute it with clean evidence | That increasing context length is always safer or always riskier |
| Failure-mode discussion | Boundary analysis | PIDP can fail when retrieval is noisy, generation resists the poisoned evidence, or context dilution weakens the poison | That the method is harmless outside the tested settings |
On Natural Questions, PIDP improves over PoisonedRAG by +4% to +16% across seven of eight models in the reported table. On MS-MARCO, the improvement is +5% to +12% across the evaluated models. On HotpotQA, several results are already near saturation, so PIDP often matches rather than dramatically exceeds PoisonedRAG; when the baseline is already close to 100%, there is not much ceiling left. This is not a failure of the thesis. It is a reminder that benchmark saturation can hide mechanism differences.
The retrieval statistics help explain the pattern. For poisoning-oriented methods, retrieval F1 is nearly saturated on Natural Questions and HotpotQA: PIDP reaches 0.992 and 1.000 respectively. On MS-MARCO, PIDP’s retrieval F1 is lower at 0.836, but still above GGPP’s 0.598. That gap matters because MS-MARCO is a noisier retrieval setting. If the poisoned passages do not reliably enter the prompt, the generator has less reason to emit the attacker’s chosen answer.
This distinction between retrieval and generation is the center of the paper.
ASR asks whether the model outputs the attacker-chosen string. Retrieval F1 asks whether the poisoned passages enter the retrieved context. High retrieval F1 is usually necessary for strict attack success, but it is not sufficient. A model can see poisoned evidence and still ignore it. A query injection can shift the topic and still fail to produce the exact target answer. The attack works best when both things happen together: the retriever surfaces the poison, and the generator treats it as answer-worthy evidence.
That is why PIDP is a compound attack, not a clever suffix with better branding.
The ablations show where the machinery actually bites
The paper’s ablation section is easy to skim and easy to misunderstand. It is not a second thesis. It is a diagnostic panel.
The prompt-only condition removes retrieval and poisoning. This isolates whether the injected query alone can redirect the model toward the target. In strict targeted terms, this is weak. In some relaxed diagnostics, it can create topic drift, but topic drift is not the same as forcing the exact wrong answer.
Clean-RAG enables retrieval on the injected query but does not add poisoned passages. This is more interesting. On some model-dataset combinations, Clean-RAG produces high relaxed success. For example, the paper reports relaxed success of 0.68 for qwen2-7b on Natural Questions, 0.90 for llama-3.1-8b on Natural Questions, 0.98 for llama-3.1-8b on HotpotQA, and 0.96 for qwen2-7b on MS-MARCO.
That does not mean clean retrieval reliably produces the attacker’s exact answer. The paper is careful here: these are relaxed diagnostics, counting target-topic steering rather than strict incorrect-answer emission. In plain language, the injected query can drag the system toward the attacker’s topic, but without poisoned passages repeatedly supporting the wrong answer, the final answer is less stable.
This is the correct interpretation:
| Component present | What it can do | Why it is not enough alone |
|---|---|---|
| Query injection only | Shift attention, topic, and sometimes model behavior | Lacks supporting retrieved evidence |
| Clean retrieval on injected query | Retrieve passages related to the attacker’s target topic | Clean passages may not support the attacker’s wrong answer |
| Poisoned corpus only | Supply false evidence for expected target queries | Traditional poisoning often assumes the attacker knows the query |
| Query injection plus poisoned corpus | Steer retrieval toward false evidence for arbitrary victim queries | Still depends on retrieval quality, model behavior, and context composition |
The business implication is subtle but important. A query-path attack may already degrade what evidence users see, even if the corpus is clean. But reliable targeted deception becomes more dangerous when the attacker can also plant evidence. That is the difference between “the system got distracted” and “the system answered from a forged record.”
A forged record is much more useful to an attacker. Also much harder to explain to a regulator with a straight face.
Poison budget is a footprint, not just a parameter
The poison budget $n$ is the number of poisoned passages the attacker can insert or maintain in the corpus. In an experiment, $n$ is a knob. In a business system, it is an operational footprint: how many fake entries survive ingestion, deduplication, moderation, source ranking, and indexing.
The paper’s poison-budget sweep tests $n \in {1,2,3,4,5}$ while keeping $k=5$. Natural Questions and HotpotQA saturate quickly: PIDP reaches above 95% ASR with only $n=2$ poisoned passages for the tested Llama-3 and Qwen models. MS-MARCO behaves differently. It climbs from roughly 30% ASR at $n=1$ to above 90% at $n=5$.
That difference is exactly what a security team should care about.
It says the attack is not merely about the LLM’s willingness to obey bad instructions. It is also about whether the retriever reliably places poisoned evidence in front of the model. On some corpora, a tiny poison footprint is enough. On noisier retrieval corpora, the attacker needs more coverage. That turns ingestion control into a defensive lever.
Not a magical lever. A lever.
If a system can reduce duplicate low-quality entries, rate-limit suspicious submissions, track source provenance, quarantine new documents before indexing, and compare near-duplicate claims across sources, it may push the attacker from the quick-saturation regime into the budget-limited regime. The paper does not prove that these controls defeat PIDP. It does show why they would attack the right part of the chain.
Top-$k$ is not a safety knob wearing a lab coat
Many RAG teams tune top-$k$ as a quality parameter. Too small, and the model misses useful evidence. Too large, and the prompt becomes noisy. Security is often treated as a side effect: more context should give the model more truth, therefore more safety.
The paper makes that assumption look too neat.
In the context-budget sweep, the authors vary $k$ from 1 to 10 while holding $n=5$. Increasing $k$ has two opposing effects. It can increase the chance that poisoned passages enter the prompt, because the retriever includes more passages. But it can also dilute any single poison with additional clean content, reducing the poison’s relative influence.
MS-MARCO again shows the tradeoff clearly. For qwen2.5-7b, the paper reports ASR dropping from 97% at $k=5$ to 82% at $k=10$, consistent with dilution. That does not mean large $k$ is safe. It means $k$ changes the mixture. In some settings it increases exposure; in others it dilutes influence. Either way, it is not a substitute for provenance, sanitization, and instruction separation.
This is one of the paper’s more practical lessons: retrieval configuration is not just an accuracy setting. It is part of the attack surface.
A system that passes ten untrusted passages directly into a generator is not merely “providing richer context.” It is expanding the amount of untrusted text that can compete for the model’s attention. Sometimes the extra text helps. Sometimes it hurts. Always, it should be treated as untrusted.
The three failure modes are more useful than the success story
The paper identifies three practical failure modes: retrieval-limited, generation-limited, and dilution-limited.
Retrieval-limited failure occurs when poisoned passages do not enter top-$k$ reliably. The authors note that on MS-MARCO with $n=1$, retrieval F1 can fall below 30%, suppressing ASR below 50%. The generator may then answer the original query or refuse. This is the cleanest defensive angle: make poisoning hard to retrieve, not merely hard to generate from.
Generation-limited failure occurs when poisoned passages are retrieved but the model does not emit the attacker’s target. A model may rely on prior knowledge, follow system instructions more rigidly, resist the injected query, or refuse. This sounds comforting until one remembers the tradeoff: refusal-prone models can reduce attack success but may also reduce useful answer rates. Enterprise buyers rarely enjoy paying for a model that responds to half their knowledge base with a principled shrug.
Dilution-limited failure occurs when larger context weakens the relative influence of the poison. This is the least intuitive one because it cuts against simple advice. More context is not automatically good; less context is not automatically safe. The effect depends on the corpus, retriever, model, prompt structure, and poison placement.
These failure modes are better than a generic limitation paragraph because they tell defenders where to inspect the pipeline:
| Failure mode | Where to look | Defensive question |
|---|---|---|
| Retrieval-limited | Retriever, index, source ranking, deduplication | Why did this passage enter top-$k$ for this query? |
| Generation-limited | Prompt hierarchy, model behavior, refusal policy | Did the model treat retrieved text as evidence or as instruction? |
| Dilution-limited | Context assembly, reranking, passage ordering | How does changing top-$k$ alter the evidence mixture? |
The paper’s own boundary is clear: PIDP is powerful, but not universal. It performs best when the query string is trusted, corpus ingestion is weakly governed, and the generator follows the retrieved evidence plus injected instruction. If a deployment strictly separates trusted instructions from untrusted content, strips suspicious suffixes, verifies document provenance, and audits retrieved context for instruction-like text, the attack surface shrinks.
That should not be read as “problem solved.” It should be read as “finally, the problem is located.”
What businesses should take from this paper
The direct finding is that a compound query-path plus corpus-path attack can outperform single-surface attacks in benchmark RAG settings. The practical inference is that enterprise RAG security has to be governed end-to-end. The uncertainty is how much risk transfers to any specific production stack.
Those three statements should not be mixed.
| Level | Statement | Confidence |
|---|---|---|
| What the paper directly shows | PIDP improves targeted attack success over several baselines across NQ, HotpotQA, MS-MARCO, and eight LLMs under the authors’ RAG setup | High within the experimental setting |
| What Cognaptus infers for business use | Query rewriting plus weak corpus ingestion is a serious combined risk for enterprise RAG assistants | Strong, but architecture-dependent |
| What remains uncertain | Transfer to production systems with stronger provenance controls, rerankers, query sanitization, access control, and instruction isolation | Requires deployment-specific testing |
For a company deploying RAG, the controls should align with the mechanism, not with a vague fear of “AI hallucination.”
First, treat the query path as a security boundary. Queries should be logged, normalized, and checked for anomalous suffixes or semantic jumps. Middleware that can rewrite user queries should be treated as sensitive infrastructure, not a harmless analytics accessory. If a plugin can append text, it can append policy.
Second, treat ingestion as governance, not data plumbing. Every document entering a retrieval corpus should carry provenance metadata: source, time, author or system, trust tier, update route, and review status. Newly ingested content should not receive the same retrieval authority as long-standing verified policy documents. Freshness is useful. Freshly poisoned is also fresh.
Third, separate evidence from instruction. Retrieved passages should be quoted, delimited, ranked, and treated as untrusted content. The prompt template should make clear that retrieved text is evidence to evaluate, not instructions to obey. If a retrieved passage says “ignore previous instructions,” the system should not admire its initiative.
Fourth, audit retrieval traces, not only final answers. A final answer may look reasonable because the retrieved evidence was already compromised. The paper’s distinction between retrieval F1 and ASR is useful here: a monitoring system should ask not only whether the final answer was wrong, but whether suspicious passages repeatedly entered top-$k$ across unrelated queries.
Fifth, red-team compound paths. Many evaluations test prompt injection against a clean database or poisoning against known queries. PIDP shows why that is too tidy. Realistic red-teaming should combine query rewriting, corpus poisoning, retrieval perturbation, and generation evaluation. Attackers do not respect org charts. They are inconsiderate that way.
The real risk is not hallucination; it is grounded deception
RAG was adopted partly because it made AI systems feel less magical. The model would no longer answer from opaque memory alone. It would retrieve documents, cite sources, and ground its output. That is an improvement.
But grounding changes the failure mode. A hallucinating model invents. A poisoned RAG system cites. The second can be more persuasive because it carries the appearance of evidence.
PIDP-Attack is therefore less about one clever adversarial method and more about a broader architectural lesson: once generation depends on retrieval, retrieval becomes part of the security perimeter. Once retrieval depends on a query, the query path becomes part of the security perimeter. Once the corpus is continuously updated, ingestion becomes part of the security perimeter.
The pipeline does not fail at one point. It learns to lie on cue because several ordinary components do ordinary things in the wrong order: accept a rewritten query, retrieve semantically aligned passages, feed them into a generator, and produce a concise answer.
No dramatic breach. No sci-fi takeover. Just a polished system confidently doing exactly what its compromised inputs asked it to do.
That is the boring version of AI risk. Naturally, it is the one most likely to show up in production.
Cognaptus: Automate the Present, Incubate the Future.
-
Haozhen Wang, Haoyue Liu, Jionghao Zhu, Zhichao Wang, Yongxin Guo, and Xiaoying Tang, “PIDP-Attack: Combining Prompt Injection with Database Poisoning Attacks on Retrieval-Augmented Generation Systems,” arXiv:2603.25164, 2026. https://arxiv.org/abs/2603.25164 ↩︎