RAG and the Art of Not Dropping the Answer

A RAG team usually starts with a familiar ambition: make the retrieved context smarter.

The raw document feels too long. The search snippet feels too primitive. The page structure looks messy. A query-focused summary sounds more elegant. A proposition list sounds more machine-readable. A paraphrase from a strong LLM sounds, at least cosmetically, like an upgrade. So the team builds another representation layer between retrieval and generation, hoping the model will reward the extra sophistication.

Sometimes it does. Often it does not. And when it does, the reason may be less glamorous than the architecture diagram suggests.

The paper On the impact of retrieved content representations in RAG pipelines studies exactly this middle layer: after documents are retrieved, but before they are passed to the generator, what form should those documents take?1 Its answer is not that representation never matters. It is sharper than that: across a controlled comparison of fourteen retrieved-document representations, the dominant mechanism is whether the transformation preserves answer-bearing content. When the answer survives, wording, structure, length, query-dependence, and even whether the text was produced by an LLM explain surprisingly little of the remaining accuracy difference.

That sounds obvious only until you notice what it quietly does to a large portion of RAG optimization work. If a compression method, snippet generator, proposition extractor, or LLM rewrite appears to improve generation, the first question should not be, “What clever representational form did it create?” The first question should be, “Did it simply keep the answer while the other method dropped it?”

A little rude to the prettier methods, perhaps. But RAG systems are not paid by the adjective.

The paper studies the representation layer, not retrieval quality

The study is designed to isolate one part of the RAG pipeline. Retrieval is held fixed. The authors use KILT-NQ, a single-hop, short-answer question answering dataset based on Natural Questions and Wikipedia. They filter the validation set down to 2,391 queries where answer and provenance information can support their evaluation design. For every query, five Wikipedia pages are retrieved. If the retrieved set does not already contain a gold answer-bearing page, the authors substitute one into the fifth position. This “gold injection” happens for 933 queries, or 39% of the final set.

That design choice matters. In a normal production RAG system, retrieval failure is often the main failure. Here, the study deliberately reduces that problem. The question is not “Can the retriever find the right page?” It is “Once the answer-bearing page is present, what happens when we transform the retrieved documents before generation?”

The authors then compare fourteen representations:

Category Representation type Examples in the paper Main purpose of the test
Baseline Original prepared document Original Establish performance when retrieved pages are passed through after preparation.
Selection Keep selected spans or tokens BM25 snippets, cross-encoder snippets, RECOMP-extractive-50, LLMLingua2-50, Selective-context-50 Test whether shorter selected content can preserve accuracy.
Summarisation Generate shorter text Section summaries; query-focused abstractive snippets Test whether LLM-generated compression helps the generator.
Reformulation Restructure or rewrite content Paraphrases; proposition lists Test whether wording or structure matters when content is mostly preserved.

The answer generators are four open models in the 8B–12B range: Qwen 3.5 9B, Gemma 3 12B, Mistral-Nemo 12B, and Llama 3.1 8B. Accuracy is judged by Qwen 2.5 32B rather than exact match, because exact string matching is brittle for short-answer QA. The same judge is also used to measure answer retention: whether the transformed gold document still supports the known gold answer.

That retention measure is the paper’s central instrument. It changes the experiment from a beauty contest among representation formats into a mechanism test.

The retention mechanism explains why very different formats behave alike

The main result is easiest to understand through a puzzle.

Paraphrase rewrites every paragraph. Proposition extraction turns prose into bullet-like factual statements. LLMLingua2-50 prunes tokens so aggressively that the output may become incoherent. Snippet-abstractive-Llama compresses a document to roughly 1.7% of its original word count. These are not subtle variations of the same format. They are different ways of presenting retrieved content to a generator.

Yet when they retain the answer at high rates, they often sit near the original baseline.

The original baseline has 98.6% answer retention. Paraphrase-Llama reaches 99.2%. Propositions-Gemma reaches 98.8%. LLMLingua2-50 reaches 98.5%. Snippet-abstractive-Llama is lower but still high at 95.2%, despite its radical compression. Their accuracy figures differ, but not in a way that supports a simple story such as “bullet lists are better,” “LLM text is better,” “shorter is better,” or “query-focused is better.”

The contrast with low-retention snippets is more revealing. Snippet-BM25 reduces documents to 3.6% of original length, but retains the answer in only 52.8% of gold documents. Snippet-cross-encoder is similarly small at 3.5% of original length, with 74.3% retention. Snippet-abstractive-Llama is even smaller at 1.7%, yet retains 95.2% of answers. The size category is similar; the outcome is not. The paper reports that snippet-abstractive-Llama stays within 2.3 accuracy points of baseline across all four generators, while BM25 and cross-encoder snippets lose 13–22 points across Qwen, Gemma, and Llama.

So the operative mechanism is not compression itself. It is answer-preserving compression.

Representation Relative size Retention What the comparison teaches
Original 100.0% 98.6% The baseline already preserves nearly all answer-bearing content after document preparation.
Paraphrase-Llama 105.3% 99.2% Rewording alone does not create a major accuracy shift when content survives.
Propositions-Gemma 119.0% 98.8% Structural conversion into propositions is not automatically superior when retention is already high.
LLMLingua2-50 47.0% 98.5% Coherence can degrade without catastrophic accuracy loss if answer-bearing material remains.
Snippet-abstractive-Llama 1.7% 95.2% Extreme compression can work when the transformation keeps the answer.
Snippet-BM25 3.6% 52.8% Very short snippets fail when sentence selection misses the answer.

The paper’s useful move is not saying that the answer must be present. Everyone can nod at that sentence and feel wise for free. The useful move is using retention to challenge attribution. If two methods differ in accuracy, and one keeps the answer while the other drops it, then the difference should not be casually credited to “better structure,” “LLM rewriting,” or “query awareness.” It may just be evidence survival wearing a nicer lab coat.

Query-dependent transformations trade coverage for safety

Query-dependent representations sound intuitively attractive. A question comes in; the system creates a representation tailored to that question; the generator receives less irrelevant material. Very clean. Very product-demo friendly.

The paper is less impressed.

Across Qwen, Gemma, and Llama, the strongest query-dependent method, snippet-abstractive-Llama, does not significantly beat the original baseline. Nor does it clearly dominate the strongest query-independent methods such as paraphrase and propositions. Incorporating the query into the representation does not, by itself, improve answer accuracy.

The paper’s per-query analysis explains why this can happen. For Qwen, the original RAG setup produces 1,076 “helpful” cases, where closed-book generation is wrong but RAG becomes correct, and 88 “harmful” cases, where closed-book is correct but RAG makes it wrong. Snippet-abstractive-Llama reduces harmful cases from 88 to 78. Good. But it also reduces helpful cases from 1,076 to 1,052. Less good. The avoided harm is 10 queries; the lost help is 24 queries. The net gain over closed-book falls from +988 to +974.

This is a nice example of why RAG evaluation should not only report aggregate accuracy. A query-focused compressor may remove distracting material, which helps when the model already knew the answer or when irrelevant context would confuse it. But the same compressor may remove supporting information that the generator would have used to answer questions it did not know. It becomes safer and less helpful at the same time.

That tradeoff is not a universal law; the paper only demonstrates it clearly for one method and one generator. But as a diagnostic pattern, it is valuable. In business terms, query-dependent context shaping should be evaluated not only by average accuracy, but by the balance between:

Transition type Meaning in RAG evaluation Business interpretation
Helpful Closed-book wrong, RAG correct The retrieval layer creates new value.
Harmful Closed-book correct, RAG wrong The retrieval layer introduces avoidable damage.
Neutral-correct Both correct RAG may be unnecessary for this query.
Neutral-wrong Both wrong Retrieval and generation still fail together.

For enterprise deployment, this matters because different products tolerate different failure shapes. A compliance assistant may value lower harm more than higher helpfulness. A research assistant may prefer maximum helpful coverage, accepting some noise. The paper does not choose that tradeoff for the reader. It gives a way to see it.

The latency result makes “smart at query time” expensive

The paper’s second practical contribution is the accuracy-latency view.

Representation transformations do not only affect answer quality. They affect where cost is paid. Query-independent transformations can be computed offline and stored. Query-dependent transformations must be computed after the user asks the question. That difference changes the product economics.

The latency results are blunt. The original representation has about 9.4 seconds of mean query-time latency, mostly from the generator processing the context. Some compressed query-independent methods are much faster: summary is about 3.2 seconds, LLMLingua2-50 about 4.4 seconds, and selective-context-50 about 4.6 seconds. The fastest sentence-level snippets are around 1.4–2.0 seconds, but their accuracy is substantially weaker.

Snippet-abstractive, the strongest query-dependent compressor, is the outlier: about 59.2 seconds of mean query-time latency. It is accurate enough to be interesting, but slow enough to be awkward unless the deployment context is unusually patient. The paper also tests smaller Gemma transformation models for snippet-abstractive. Accuracy improves sharply from 270M to 4B parameters and then plateaus; Gemma-12B essentially matches Gemma-27B accuracy at about half the transformation latency. That is useful engineering information, but it does not overturn the frontier. Even with a smaller transformation model, query-time abstractive compression remains slower than query-independent alternatives with comparable accuracy.

The clean business lesson is not “always summarize offline.” It is: decide whether query-time transformation earns its latency.

Representation choice Accuracy pattern in the paper Latency pattern Operational reading
Original documents Strong baseline for three of four generators Slow, because full context is processed Good control; often too costly for interactive systems.
Query-independent summaries Retain much of baseline gain Around one-third of original latency Attractive when offline preprocessing is feasible.
LLMLingua2-50 High retention and near-baseline behavior for several generators Roughly half original latency Useful compression candidate, despite unattractive text. The model is not reading for elegance.
Sentence snippets Fast Large accuracy losses when retention is low Cheap, but risky when answer selection is brittle.
Query-focused abstractive snippets Near baseline on three of four generators Around six times original latency Powerful but hard to justify for normal interactive workloads.

This is where the paper becomes directly useful for product teams. A representation layer should not be selected from an offline leaderboard alone. It should be plotted against latency, transformation placement, and retention. Otherwise the team may proudly deploy a “smarter” representation that is simply a slower way to keep the same answer.

A tragedy, but with more GPUs.

LLM-produced text does not get special treatment once retention is accounted for

Another tempting belief is that generators may prefer text produced by LLMs, especially by related model families. The paper tests this in two ways.

First, it compares LLM-produced and non-LLM transformations. At matched retention, the distinction does not explain accuracy. RECOMP-extractive-50 and LLMLingua2-50 are non-LLM methods with high retention, 96.9% and 98.5% respectively. They perform within roughly one to two points of LLM-produced Summary-Gemma on the three generators that track retention most clearly. The weaker non-LLM methods underperform mainly because they discard answer-bearing content: BM25 snippets at 52.8% retention, cross-encoder snippets at 74.3%, and selective-context-50 at 81.9%.

Second, the authors test model-family preference. If Gemma generators preferred Gemma-produced transformations, and Llama generators preferred Llama-produced transformations, one could argue for family-aligned preprocessing. The paper does not find that pattern. For summaries, all four generators prefer Gemma transformations, including Qwen and Mistral, which are not family-aligned with either transformation model. The likely explanation is again retention: Gemma summaries retain answers more often than Llama summaries, 97.8% versus 94.4%. For snippet-abstractive, retention is closely matched, and three of four generators prefer Llama outputs, but the pattern is not family alignment because non-family generators show it too.

The conclusion is narrow but useful. Do not assume that LLM-generated retrieved context has an inherent advantage. Do not assume that same-family preprocessing is a free gain. Test whether the answer survives, then test whether any residual preference remains.

For enterprise RAG, this matters because LLM-based rewriting is expensive, operationally complex, and sometimes hard to audit. If its main advantage is retention, a cheaper extractor or offline compressor may be enough. Conversely, if an LLM transformation has better retention at extreme compression, as snippet-abstractive does here, the advantage is real — but the invoice should be attached to the latency chart, not hidden behind the phrase “semantic compression.”

Mistral-Nemo is the useful exception, not an annoying footnote

The retention account works best for Qwen, Gemma, and Llama. Mistral-Nemo behaves differently.

Its original-retrieved accuracy is 73.9%, around 6–7 points below the other three generators. That weakness does not look like a general lack of knowledge: Mistral’s closed-book accuracy is 47.5%, the second-highest among the four. It also does not look like inability to use a single answer-bearing document: gold-only accuracy is 84.8%, in the middle of the pack.

The problem appears when Mistral receives the full retrieved set. The gold-5x condition, which repeats the gold document five times, scores 80.8%, four points below gold-only. Replacing four copies with retrieved non-gold content drops further to 73.9%. The paper interprets this as evidence that Mistral is sensitive to something about length, repetition, non-gold context, or context composition. The exact property remains unidentified.

This exception is important because it prevents a lazy version of the retention story. Retention is the dominant factor in the main pattern, but not the only possible factor in RAG behavior. Some generators may be more sensitive to context length, repetition, distractors, or the way multiple documents interact. In production terms, “the answer is still present” is necessary but may not be sufficient for every model and workload.

The right interpretation is therefore:

What the paper directly shows What Cognaptus infers for business use What remains uncertain
High-retention methods cluster near baseline for three of four generators. Retention testing should become a standard diagnostic for RAG transformations. Whether the same pattern holds for larger proprietary models or non-QA tasks.
Low-retention snippets lose substantial accuracy despite being short and query-dependent. Compression should be evaluated by evidence survival, not compression ratio alone. How to define retention for multi-hop, multimodal, or reasoning-heavy tasks.
Query-dependent abstractive snippets can reduce harmful cases but also lose helpful cases. Teams should measure helpful/harmful transitions, not just average accuracy. Whether task-specific query-time transformations can improve this tradeoff.
Query-time abstractive compression is costly. Offline high-retention representations may offer better ROI for many systems. Actual latency depends on infrastructure, batching, model serving, and parallelization.
Mistral-Nemo does not fully follow the retention account. Model-specific context sensitivity should be tested before standardizing a RAG pipeline. The unidentified context-level factor needs separate experiments.

That last column is not decorative caution. It tells you where not to overgeneralize.

The study is idealized in ways that make it cleaner, not weaker

The paper’s constraints are unusually important for interpretation.

First, retrieval is idealized. Gold injection ensures an answer-bearing document is always in the retrieved set. This is exactly what allows the authors to isolate representation effects. But it also means the findings should not be read as a complete enterprise RAG benchmark. In a real system, retrieval misses, index freshness, permissions, document chunking, and source quality may dominate. A representation layer cannot preserve an answer that retrieval never brought back.

Second, the task is narrow: textual, single-hop, short-answer QA on KILT-NQ. The retention measure is clean because a known answer-bearing document can be checked after transformation. Multi-hop questions are messier. The answer may be distributed across documents. A transformation could preserve each fact but destroy the relation among facts. In that case, “answer retention” needs a richer definition.

Third, the representational dimensions are bundled. Paraphrase changes wording. Propositions change structure and length. LLMLingua2 changes length and coherence. Snippet-abstractive changes length, wording, and query focus. The paper can show that recognizable high-retention transformations behave similarly, but it does not independently vary every dimension.

Fourth, retention and accuracy are judged by the same LLM, Qwen 2.5 32B. This is reasonable given the weakness of exact match, but it is still a measurement dependency. If the judge fails to detect an answer after a transformation, retention may be underestimated. If the judge has representation sensitivities of its own, both retention and accuracy labels may inherit them.

These boundaries do not damage the paper’s main usefulness. They define it. The study is not claiming to solve all RAG design. It is saying that, under controlled conditions where retrieval has done its job, many representational debates collapse into a simpler question: did the transformation keep the evidence?

The enterprise RAG checklist should start with evidence survival

The practical lesson is not to abandon representation engineering. It is to discipline it.

A team building RAG for policies, contracts, manuals, product documentation, or financial filings can take a direct workflow from this paper:

  1. Build a benchmark set where answer-bearing source material is known.
  2. Run each transformation on those sources.
  3. Measure whether the transformed representation still supports the answer.
  4. Compare downstream accuracy only after retention is visible.
  5. Plot accuracy against latency and query-time cost.
  6. Separate helpful and harmful transitions where possible.
  7. Treat model-specific exceptions as diagnosis targets, not random noise.

This workflow is not glamorous. It will not look as impressive in a vendor slide as “agentic semantic context rewriting.” But it answers the question that matters before the expensive machinery starts spinning: does the system still carry the facts the generator needs?

There is also a governance benefit. Retention testing creates an audit trail for transformations. If a RAG system answers incorrectly after summarization, the team can ask whether the summary dropped the relevant clause, number, date, exception, or entity. That is more actionable than blaming “the model.” In regulated or high-stakes settings, this difference is not philosophical. It is operational.

The paper also suggests a more sober ROI hierarchy:

Optimization target Before this paper, a team might ask After this paper, a better first question is
Compression How short can we make the context? How much answer-bearing evidence survives at that compression level?
Query focus Did we tailor the context to the question? Did tailoring reduce harm without losing more helpful cases?
LLM rewriting Is the representation more fluent or semantic? Does the rewrite preserve evidence better than cheaper alternatives?
Proposition extraction Is the format more machine-readable? Does the structural change add value beyond retention?
Model-family alignment Did the same model family produce the context? Is there any residual preference after retention is matched?
Latency optimization Is generation faster? Is the accuracy-latency frontier better after transformation cost is included?

That is the paper’s business relevance pathway in one sentence: evaluate retrieved-content transformations by answer survival first, elegance second, and latency always.

Conclusion: the best RAG representation is not always the prettiest one

The paper’s mechanism-first contribution is to move RAG representation analysis away from surface form and toward evidence survival. It compares fourteen ways of representing retrieved documents while holding retrieval fixed. It shows that answer retention is the strongest explanation for downstream accuracy across most tested generators. It finds no systematic advantage for query-dependence, LLM-produced transformations, or same-family preprocessing once retention is considered. It adds a latency view showing that query-time transformations can be hard to justify unless their accuracy gain is real enough to pay for their delay.

The most useful sentence for practitioners is not “retention matters.” Of course it does. The useful sentence is: before crediting a representation mechanism, control for whether the answer-bearing content survived.

That sentence can save a team from weeks of optimizing the wrapper around a missing fact.

And in RAG, as in many parts of business, misplacing the obvious thing is still the most expensive mistake.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jonathan J. Ross, Bevan Koopman, Anton van der Vegt, and Guido Zuccon, “On the impact of retrieved content representations in RAG pipelines,” arXiv:2605.30790v1, 29 May 2026, https://arxiv.org/abs/2605.30790↩︎