RAG’s Receipt Problem: When Correct Answers Don’t Prove Retrieval
Retrieval-augmented generation has become the respectable outfit enterprise AI wears when it wants to look grounded. Add a document store, retrieve a few passages, attach citations, and the answer suddenly appears more disciplined than a free-floating chatbot.
That appearance is useful. It is not proof.
The uncomfortable problem is simple: a model can receive retrieved context, produce a correct answer, and still rely mainly on what it already stored in its parameters. The answer brought a receipt. That does not prove it paid.
Two recent papers make this problem harder to ignore. Generating Leakage-Free Benchmarks for Robust RAG Evaluation proposes SeedRG, a benchmark-generation framework designed to reduce cases where benchmark questions are already answerable from model memory.1 The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context proposes Computational Reality Monitoring, or CRM, a diagnostic framework for detecting internal representational differences when a model is conditioned on retrieved context versus no context.2
The useful reading is not “Paper A says this, Paper B says that.” That would be tidy and mostly pointless. The stronger interpretation is a complementary logic chain:
- RAG benchmarks can be contaminated by parametric memory.
- Leakage-resistant benchmarks are needed to test whether retrieval actually matters.
- Even when context is supplied, output-level correctness does not prove context use.
- Internal diagnostics can provide partial evidence about whether retrieval changes model computation.
- RAG trust should therefore be treated as staged evidence collection, not as a single score on a public QA dataset.
For business users, the lesson is blunt: if a vendor says, “Our RAG system answers correctly with citations,” the next question should be, “Compared with what?” If the base model could answer the same question without retrieval, the RAG layer may be mostly theater. Sophisticated theater, perhaps. Still theater.
The shared problem: memory can impersonate retrieval
RAG is supposed to solve a practical business problem: the model should answer from current, authorized, domain-specific evidence instead of from stale or opaque training data. This is especially important in legal review, compliance, customer support, internal knowledge management, finance, healthcare support, and any setting where “the model seemed confident” is not a governance policy.
But the moment a model already knows the answer, standard RAG evaluation becomes ambiguous. Did the retrieved passage improve the answer? Did it merely sit in the prompt while the model answered from memory? Did the citation reflect actual computational reliance, or only post-hoc compatibility?
The two papers approach that ambiguity at different layers:
| Layer of the problem | Practical question | Paper role | What it catches | What it does not fully prove |
|---|---|---|---|---|
| Benchmark validity | Can the question be answered without retrieval? | SeedRG | Public or static benchmarks whose questions are already inside model knowledge | That every generated item perfectly preserves real-world benchmark value |
| Source attribution | Did retrieved context actually change model computation? | CRM | Internal divergence invisible in the final text | Per-answer certification that the model used retrieval rather than memory |
This division matters. SeedRG works before evaluation, trying to make the test fair. CRM works after or during generation, trying to reveal whether the model’s internal pathway changes when context is supplied. One cleans the exam. The other inspects the student’s scratch work.
Step 1: first clean the exam
SeedRG begins from a problem many RAG evaluations politely walk around: if a language model can answer benchmark questions with no retrieved context, the benchmark is no longer testing retrieval. It is testing a mixture of retrieval, memorization, reasoning, and benchmark exposure.
The paper formalizes this using two ideas:
| Criterion | Meaning | Why it matters |
|---|---|---|
| Leakage Error | How often the model answers correctly without retrieved context | High leakage means retrieval is not needed for many items |
| Answerability Accuracy | How much the correct context helps the model answer | High answerability means retrieval has room to matter |
In the authors’ experiments, existing benchmarks show substantial no-context accuracy. The paper reports that models can answer a large share of original benchmark questions without retrieval, with leakage varying by dataset and model. On original benchmarks, this reached from roughly one-third of questions to well above half in many cases, and even higher in some reported QASC settings. That is not a small nuisance. It is a measurement problem.
If a model answers a HotpotQA-style or QASC-style item correctly without seeing the support documents, adding retrieval afterward does not tell us much about retrieval quality. It tells us the model was already prepared for the quiz. Congratulations to the model; less congratulations to the evaluator.
SeedRG’s proposed fix is not simply “ask an LLM to make new questions.” The paper tests that kind of direct generation and finds it unreliable: generated questions may still use familiar entities, leak parametric knowledge, or create factual inconsistencies. Instead, SeedRG starts from existing multi-hop benchmark instances and transforms them while trying to preserve the reasoning structure.
Its pipeline has two central moves.
First, type-constrained entity replacement. If the seed item involves a composer, the replacement should also be composer-like; if it involves a city, the replacement should remain city-like. The goal is to move entities outside the model’s likely parametric knowledge while preserving the semantic role of each entity in the reasoning chain.
Second, reasoning graph verification. Multi-hop questions are not difficult only because of vocabulary. They are difficult because of dependency structure: which facts connect, how many hops are required, and whether the answer depends on a particular graph of entities and relations. SeedRG extracts reasoning graphs before and after transformation and rejects generated items when the structure is not preserved.
This is the key business insight from SeedRG: a good RAG benchmark needs to be both unfamiliar and comparable. If the generated benchmark is unfamiliar but much easier, it flatters retrieval. If it is unfamiliar but incoherent, it punishes retrieval unfairly. If it is familiar, it tests memory. Benchmark design becomes less like making trivia questions and more like constructing a controlled experiment.
The results support that framing. SeedRG reduces no-context leakage much more effectively than direct generation and restores more meaningful differentiation among RAG systems. On original benchmarks, the tested RAG systems often cluster tightly, making retrieval methods look similar. On SeedRG-generated benchmarks, the spread widens, revealing differences that were previously hidden.
That is the first checkpoint: before asking whether a RAG pipeline works, make sure the test requires retrieval in the first place.
Step 2: then inspect whether context changes computation
SeedRG handles the upstream problem: contaminated benchmarks. CRM handles a different problem: even with retrieved context in the prompt, the output may not reveal whether context actually governed generation.
The CRM paper calls this the attribution blind spot. When retrieved context overlaps with data seen during pretraining, two computational routes can lead to the same surface answer:
- The model reads from the provided context.
- The model recalls from parametric memory.
If both routes produce a correct and context-consistent answer, ordinary output-level checks cannot distinguish them. Faithfulness metrics, citation checks, and answer correctness can all look good while source reliance remains unknown.
CRM’s idea is to compare model behavior under paired conditions: with retrieved context and without retrieved context. Instead of relying only on the final text, it examines three levels of divergence:
| CRM level | Signal type | Access requirement | Interpretation |
|---|---|---|---|
| Level 1 | Sequence-level semantic difference between generated outputs | Black-box | Does the final generated text change? |
| Level 2 | Token-level distributional divergence | Grey-box | Do output probabilities shift during generation? |
| Level 3 | Latent trajectory shift in hidden states | White-box | Do internal representations move differently with context? |
The paper’s central empirical result is that the meaningful signal is largely latent. Across nine model variants, CRM distinguishes member-conditioned from non-member-conditioned generation with AUC values reported in the 0.71–0.95 range, while token-likelihood baselines remain near 0.55–0.60. Removing the surface-level features changes performance very little, indicating that the detection signal lives mainly in hidden representations rather than in the visible output.
This is exactly why RAG evaluation cannot stop at “the answer matches the document.” The answer may match because the document was used. It may also match because the model already knew. If the final output is the only evidence, both worlds look the same.
The CRM paper also adds useful caution. It does not claim to certify the source of every individual generation. It detects membership-conditioned representational divergence: whether internal computation differs in aggregate when supplied context comes from data the model likely saw during pretraining versus held-out data. That is a proxy, not a final source-attribution certificate.
This caveat is not a weakness to hide. It is the paper being honest, which is refreshing in a field where every dashboard wants to be called “trustworthy.” CRM gives evidence that internal source-related signals exist and can be measured. It does not yet give a production guarantee that a specific answer came from a specific retrieved passage.
For business interpretation, that boundary is crucial. CRM is not a magic lie detector for models. It is closer to an audit instrument: useful for controlled testing, model comparison, failure analysis, and possibly middleware auditing where the enterprise controls the model stack. For closed-source API-only systems, the deeper CRM levels may not be available unless the provider exposes internal traces or suitable diagnostic endpoints.
The combined logic chain
Put the two papers together and a clearer RAG evaluation architecture appears.
| Chain step | Failure mode | Better evidence |
|---|---|---|
| 1. Test without context | The model already knows the answer | No-context baseline and leakage measurement |
| 2. Test with gold context | The question may be unanswerable or badly generated | Gold-context answerability check |
| 3. Preserve reasoning structure | Synthetic questions become easier, harder, or incoherent | Reasoning graph preservation |
| 4. Compare real RAG systems | Original benchmarks hide retrieval differences | Leakage-resistant renewable benchmarks |
| 5. Inspect internal change | Correct output still does not prove context use | Paired with-context / no-context representation diagnostics |
| 6. State the evidence boundary | Evaluation is mistaken for certification | Clear separation between benchmark validity, attribution evidence, and per-answer proof |
This is the real article spine: RAG trust should be layered.
SeedRG says: do not use stale or leaky exams and pretend they measure retrieval. CRM says: do not use correct outputs and pretend they prove source reliance. Together, they push RAG evaluation away from leaderboard comfort and toward operational evidence.
That shift matters because enterprise RAG systems are increasingly sold as governance-friendly. They promise answers grounded in approved documents, policy manuals, internal knowledge bases, case histories, and customer records. But if the evaluation does not separate memory from retrieval, governance teams may approve systems based on an illusion of control.
The illusion is especially dangerous when the model’s memory is partially correct but stale. A model might answer from old policy knowledge while the retrieved document contains an updated rule. If the final answer happens to remain plausible, shallow evaluation may not catch the mismatch. Worse, the system may attach a citation to a document it did not computationally rely on. That is not grounding. That is decoration.
What each paper contributes to a business due-diligence workflow
A practical RAG due-diligence process should not ask only, “What is your accuracy?” It should ask how the accuracy was earned.
1. Require no-context baselines
Every serious RAG evaluation should include a no-context condition. If the model performs well without retrieval, the benchmark is partly measuring parametric memory. That does not make the model bad. It makes the evaluation ambiguous.
For business buyers, this is a simple procurement question:
“Show us the same benchmark with retrieval disabled.”
If the score barely changes, either retrieval is unnecessary for that task or the benchmark is too leaky to evaluate retrieval. In both cases, the RAG claim weakens.
2. Use gold-context checks before testing retrieval
A failed RAG answer can come from poor retrieval, poor generation, unclear questions, or impossible context. Gold-context evaluation helps separate these factors. If the correct supporting document is supplied and the model still fails, the issue is not retrieval alone.
SeedRG’s leakage and answerability framing is valuable here because it separates two questions that are often mixed together:
- Is the item answerable only with context?
- Does the correct context actually enable the model to answer?
A good benchmark needs both.
3. Renew the benchmark
Static public benchmarks age badly. As models are trained on broader corpora, benchmark content may become part of parametric memory. The business equivalent is evaluating a new employee using an exam whose answer key has been posted online for years.
SeedRG’s renewable-generation idea is useful beyond academic leaderboards. Enterprises can create rotating evaluation sets from internal documents, with entity replacement or controlled perturbation where appropriate, then test whether retrieval remains necessary.
This needs care. In regulated domains, synthetic transformation must not distort legal, medical, financial, or contractual meaning. But the principle is sound: evaluation data should be refreshed faster than models can memorize it.
4. Preserve task difficulty, not just topic
A synthetic RAG benchmark is not automatically valid because it is new. Direct generation may create questions that leak, contradict themselves, or change reasoning difficulty. SeedRG’s reasoning-graph check points toward a more disciplined method: preserve the structure of the task, not merely its theme.
For enterprise teams, this suggests a useful review question:
“When you generated or refreshed the evaluation set, how did you verify that the new questions preserved the difficulty and reasoning structure of the original task?”
If the answer is “we asked an LLM,” keep asking.
5. Add internal diagnostics where the model stack allows it
For open models or provider-controlled deployments with sufficient instrumentation, CRM-like diagnostics can become part of RAG audit infrastructure. The point is not to replace answer evaluation. The point is to add another evidence layer when output-level signals are insufficient.
The CRM paper’s prototype suggests that compact layer-level trajectory features can be logged and monitored with practical latency in an experimental setup. That does not make it production-ready across all domains, but it makes the direction concrete: RAG observability should include not only retrieved documents and final answers, but also evidence about whether context changes model computation.
For many businesses, this will remain a medium-term capability. Today, most teams using closed APIs cannot inspect hidden states. Still, the procurement implication is immediate: vendors claiming “grounded generation” should explain what evidence they use to distinguish context reliance from memory recall.
What the papers show — and what business interpretation adds
It is worth separating the research claims from the business reading.
| Question | What the papers show | Business interpretation |
|---|---|---|
| Are existing RAG benchmarks always reliable? | SeedRG shows that original benchmarks can have high no-context accuracy, reducing their ability to test retrieval. | Do not accept RAG scores unless no-context leakage has been measured. |
| Can synthetic benchmark generation solve leakage automatically? | Direct generation reduces leakage only partially and can introduce factual inconsistency or difficulty shifts. | Synthetic evaluation needs validation rules, not just fresh-looking questions. |
| Can output correctness prove context use? | CRM argues that context-consistent outputs cannot distinguish memory from retrieval when both lead to the same answer. | Citations and correctness are necessary but not sufficient evidence of grounding. |
| Can internal diagnostics help? | CRM finds latent representational signals that distinguish member-conditioned from non-member-conditioned generation across tested models, with important boundary conditions. | For controllable model stacks, internal monitoring can improve RAG auditability, but it is not per-answer proof. |
| Is the problem solved? | No. SeedRG depends on generation and filtering quality; CRM remains a proxy for source attribution. | Treat RAG governance as staged risk reduction, not certification theater. |
This distinction prevents overclaiming. The papers do not prove that every deployed RAG system is fake. They also do not give a complete method for certifying every answer’s source. What they do provide is a sharper evaluation vocabulary: leakage, answerability, benchmark aging, reasoning-structure preservation, latent trajectory shift, and source-attribution proxy gaps.
That vocabulary is already valuable. Many RAG projects fail not because teams lack vector databases, but because they lack precise questions about what their evaluation is measuring.
A practical RAG evaluation stack
A business-ready RAG evaluation stack inspired by these papers might look like this:
| Stage | Evaluation action | Minimum evidence | Stronger evidence |
|---|---|---|---|
| Benchmark hygiene | Remove or flag questions answerable without context | No-context accuracy report | Renewable leakage-resistant benchmark generation |
| Context usefulness | Verify that correct context improves answers | Gold-context evaluation | Per-question answerability scoring |
| Difficulty control | Prevent synthetic items from changing task complexity | Human review or heuristic checks | Reasoning-graph or dependency-structure validation |
| Retrieval comparison | Compare RAG methods under leakage-resistant tests | Accuracy by retriever/generator pair | Stability across regenerated benchmark versions |
| Attribution audit | Test whether context changes model computation | Output-delta and probability-delta checks | Hidden-state or layer-level diagnostics where available |
| Governance boundary | Document what is and is not proven | Evaluation memo | Audit logs, threshold calibration, domain-specific failure analysis |
This is not cheap compared with throwing a PDF folder into a vector database and calling it a product. But neither is cleaning up after a model that cited the right policy while reasoning from the wrong one.
For small teams, the practical starting point is simple:
- Always run a no-context baseline.
- Always run a gold-context baseline.
- Separate retrieval failure from generation failure.
- Refresh evaluation items regularly.
- Treat citations as claims requiring validation, not as proof.
For larger teams or AI vendors, the bar should be higher:
- Build leakage-resistant internal benchmarks.
- Track benchmark aging.
- Validate synthetic questions structurally.
- Instrument open models for context-sensitivity diagnostics.
- Report what the system cannot prove.
That last point is not legal pessimism. It is product maturity.
The hidden risk: RAG can pass the wrong test
The most dangerous RAG system is not always the one that fails obviously. It may be the one that passes a weak benchmark beautifully.
If the benchmark is leaky, the model can look retrieval-competent because it remembers the answers. If the output is correct, the system can look grounded because the citation is compatible. If the retrieved document overlaps with pretraining data, even a faithful-looking answer may not reveal whether the model used the document or its own memory.
This is the receipt problem again. A cited answer is a receipt-like object. But serious evaluation asks whether the receipt corresponds to the transaction.
SeedRG helps by making the transaction harder to fake: the question should require retrieval. CRM helps by looking for internal signs that the model’s computation changes when context is supplied. Neither method is complete alone. Together, they suggest a better standard.
RAG evaluation should not be a beauty contest among polished answers. It should be an evidence chain:
- Was retrieval necessary?
- Was the context sufficient?
- Was task difficulty preserved?
- Did the retrieval system find the right evidence?
- Did the model’s computation respond to that evidence?
- What remains unproven?
That final question is the one mature AI teams should learn to enjoy. It is where governance becomes useful rather than decorative.
Closing: from grounded-looking to grounded-enough
RAG is still one of the most practical patterns for enterprise AI. The point is not to dismiss it. The point is to stop evaluating it as if the presence of retrieval automatically creates grounding.
The two papers point toward a more disciplined view. SeedRG shows that RAG benchmarks need to resist parametric shortcuts and preserve reasoning difficulty. CRM shows that even correct, context-consistent outputs may hide an attribution blind spot, and that some of the missing evidence lives inside the model’s representations.
The combined message is not glamorous, which is probably why it is useful: RAG trust is not a single score. It is a chain of controls.
For business leaders, the practical takeaway is straightforward. Do not ask whether the system can answer. Ask whether it needed the approved evidence to answer, whether the test made that necessity visible, and whether the vendor can show where the evidence boundary lies.
A grounded answer should not merely look grounded. It should survive an audit of how grounding was tested.
Cognaptus: Automate the Present, Incubate the Future.
-
Generating Leakage-Free Benchmarks for Robust RAG Evaluation, arXiv:2605.08838, 2026. HTML full text: https://arxiv.org/html/2605.08838. PDF fallback: https://arxiv.org/pdf/2605.08838. ↩︎
-
The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context, arXiv:2605.26778, 2026. HTML full text: https://arxiv.org/html/2605.26778. PDF fallback: https://arxiv.org/pdf/2605.26778. ↩︎