The inbox problem hiding inside RAG

Inbox.

That is the easiest way to understand what goes wrong in many retrieval-augmented generation systems. A query arrives. The system retrieves a few documents. The answer is not obvious. So the system retrieves more. Then more. Then perhaps a web search result. Then a rewritten query. Then another bundle of passages.

Soon the model has a bigger inbox, not a better brief.

For simple questions, that may be survivable. For multi-hop questions, it becomes actively dangerous. Multi-hop reasoning is not just “find a document.” It is “find one fact, use it to identify another fact, then answer from the second fact without losing the first.” If the context window fills with topical but useless material, the answer-bearing bridge can be technically present and still practically invisible. Congratulations: the system has “more context.” It also has more ways to be wrong.

The paper “Replace, Don’t Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly” introduces SEAL-RAG, a training-free controller designed around a blunt principle: when the evidence window is full, new evidence should replace weak evidence instead of being appended to the pile.1 This sounds almost too sensible to count as research. Naturally, that is why it matters.

The paper’s central contribution is not a new foundation model, a heroic retriever, or a fashionable agent costume. It is a control policy. SEAL-RAG treats the top-$k$ evidence set as a scarce working memory. It extracts entities and relations from the current evidence, identifies what is missing, issues targeted micro-queries, and then swaps out low-utility passages for evidence that closes the specific gap. The window remains fixed. The composition changes.

That is the mechanism. The business lesson is just as important: in production RAG, the costly resource is not only tokens. It is the model’s attention over evidence. Treating context as a junk drawer is not a scaling strategy. It is a very polished way to misplace the fact you needed.

The mistake is assuming that larger context means better evidence

The familiar RAG pipeline has a comforting rhythm: retrieve top-$k$ passages, place them into the prompt, generate an answer. The approach is easy to explain and easy to deploy. It also hides a fragile assumption: that the retrieved set is already sufficient, or at least that increasing $k$ will move the system closer to sufficiency.

Multi-hop tasks expose the assumption. Suppose the question asks: “Which city hosted the Olympic Games in the same year that Blur released Parklife?” The system needs to connect Parklife to its release year, then connect that year to the Olympic host city. If the initial retrieval finds only the album page, a generator cannot magically reason its way to the Olympic host city unless the bridge evidence is retrieved or already known. Under a strict grounding rule, it should not use parametric memory. Under a loose grounding rule, it may hallucinate politely. Neither is ideal.

The usual repair instinct is additive. Self-RAG-style systems critique the generated answer and retrieve again. CRAG-style systems grade retrieved documents and add corrective search results. Adaptive-$k$ methods retrieve a larger pool and prune it down. These are useful ideas, but they solve different parts of the problem.

SEAL-RAG’s authors frame the deeper issue as context dilution: the evidence set becomes crowded with distractors, duplicates, and half-relevant material. The answer may be somewhere in the room, but the model now has to reason through clutter. In enterprise terms, this is the difference between giving an analyst a two-page evidence memo and giving them a folder called “probably relevant, good luck.”

The paper’s reframing is therefore precise: multi-hop RAG should not be treated as context accumulation. It should be treated as fixed-budget evidence assembly.

SEAL-RAG changes the controller, not the universe

The useful part of SEAL-RAG is its discipline. The method does not assume that the retriever suddenly becomes perfect. It assumes the first retrieval may be incomplete, then gives the controller a structured way to repair the evidence set without expanding the final context.

The loop has four stages: Search, Extract, Assess, Loop.

SEAL-RAG stage What the system does Why it matters operationally
Search Initializes a fixed top-$k$ evidence set using standard retrieval. Starts from the same kind of retrieval stack many teams already use.
Extract Builds an entity ledger from the current passages: entities, aliases, relations, qualifiers, and provenance. Converts vague “maybe relevant” context into a structured view of what is actually supported.
Assess Checks whether the ledger contains enough evidence to answer the question. Avoids generating merely because something was retrieved. Retrieval is not evidence until it supports the task.
Loop If something is missing, creates targeted micro-queries and replaces weak passages with gap-closing candidates. Repairs the evidence set while keeping latency and context size bounded.

The important object is the entity ledger. Instead of asking the model to stare at raw passages and somehow feel whether they are enough, SEAL-RAG projects the evidence into a structured state: which entities are present, which relations are supported, which dates or locations are missing, and where each claim came from.

That ledger lets the system formulate explicit gaps. The paper distinguishes missing entities, missing relations, and missing qualifiers. This distinction is not decorative. It changes the query. “Tell me about Blur and Parklife” is a broad rewrite. “1994 Olympic Games host city” is a micro-query. One wanders around the topic. The other goes to work.

Then comes the part that separates SEAL-RAG from ordinary corrective retrieval: replacement.

The paper defines an entity-first utility function that rewards candidates for gap coverage, corroboration, and non-redundant novelty, while penalizing redundancy. A new candidate does not simply join the prompt. It competes for a slot. If it closes a gap better than a current passage, the current low-utility passage is evicted. The system also uses a hysteresis threshold to avoid pointless churn and a dwell-time guard so newly inserted evidence is not removed before the sufficiency gate can inspect it.

In other words, SEAL-RAG behaves less like a hoarder and more like an editor. The context window is not expanded. It is revised.

The key mechanism is replacement under a fixed budget

A simple diagram captures the difference:

Additive repair:

Initial evidence:
[A] [B] [distractor]

Repair:
[A] [B] [distractor] + [new candidate] + [another candidate] + [web result]

Problem:
The model receives more material, but not necessarily a cleaner evidence set.


SEAL-RAG repair:

Initial evidence:
[A] [B] [distractor]

Ledger:
A and B are present; missing relation or qualifier C.

Micro-query:
Find C.

Replacement:
[A] [B] [C]

Result:
Same context size, higher evidence density.

This is the paper’s most business-relevant move. In many production systems, the actual constraint is not whether one can technically send more tokens. It is whether the system can answer reliably within a predictable latency, cost, and compliance envelope. Expanding context makes inference cost less predictable and can make evidence harder to audit. Replacement keeps the answer stage fixed: the generator sees the same number of evidence slots, but those slots are supposed to become more informative.

The paper formalizes this as a bounded inference process: the generator is invoked once over a fixed-size context, while repair loops operate before final generation. That distinction matters. A system can spend some effort improving the evidence buffer without turning every answer into an open-ended agent expedition. The authors’ ablation results later support this point: most of the gain comes from the first repair step, not from a long chain of wandering reflections.

This is also where the paper quietly pushes against a lazy interpretation of “agentic RAG.” The value is not that the system loops. Loops are cheap to draw and expensive to trust. The value is that the loop has a structured state, an explicit missing-information target, and a replacement rule. Without those, an agentic RAG system is just a retrieval system with commitment issues.

The experiments test controller logic, not model charisma

The evaluation design is worth slowing down for because it affects how the results should be read.

The authors compare SEAL-RAG with Basic RAG, Self-RAG, CRAG, and Adaptive-$k$ in a shared environment. The systems use the same retriever setup, vector store, generator family, and evaluation protocol. The point is to isolate the controller policy: retrieve once, add and critique, correctively append, dynamically prune, or actively repair through replacement.

The benchmarks are HotpotQA and 2WikiMultiHopQA, both multi-hop question-answering datasets. HotpotQA tests bridge and comparison reasoning. 2WikiMultiHopQA adds more compositional entity reasoning. The paper reports Judge-EM for correctness, using GPT-4o as an external judge under a strict rubric, and evidence quality through gold-title precision, recall, and F1.

That judge setup is useful but not magic. It reduces parametric leakage by making the answer depend on retrieved context, and it allows paired statistical testing. Still, an LLM judge is not a human adjudication panel. For an academic benchmark paper, this is acceptable evidence. For a medical, legal, or regulatory deployment, it is not a free pass. We will return to that boundary later, because apparently high-stakes systems do not become safe because a table has decimals.

For now, the main question is narrower: when the same underlying retrieval and generation environment is used, does fixed-budget replacement produce better evidence sets and better answers than additive or pruning-based controllers?

The answer in the paper is yes.

At one evidence slot, replacement turns an impossible task into a solvable one

The strictest HotpotQA setting uses $k=1$. This is almost unfair to standard multi-hop RAG, but usefully unfair. With one visible evidence slot, a system cannot simply show the generator all hops. It must either retrieve the final answer-bearing evidence or fail.

In this bottleneck regime, SEAL-RAG improves Judge-EM over the best baseline across all tested model backbones. The reported HotpotQA $k=1$ results are:

Model backbone Best baseline Judge-EM SEAL-RAG Judge-EM Gain SEAL-RAG evidence precision
gpt-4o-mini 55% 62% +7 pp 86%
gpt-4o 59% 73% +14 pp 91%
gpt-4.1-mini 52% 71% +19 pp 87%
gpt-4.1 63% 73% +10 pp 90%

The interpretation is not “SEAL-RAG found a way to make one passage contain two documents.” The better interpretation is that the controller moves part of the reasoning upstream. It uses the initial passage to extract the bridge fact, then retrieves the answer-bearing passage that should occupy the single final slot.

This is retrieval-time reasoning. The generator receives less clutter because the controller has already performed evidence assembly. In a business system, that is exactly the kind of work you want done before final answer generation. You do not want the generator to discover, during response writing, that it never received the needed bridge evidence. That is not reasoning. That is expensive embarrassment.

The $k=1$ result also clarifies why replacement is not the same as pruning. Pruning can only choose from what was already retrieved. SEAL-RAG can issue a targeted micro-query for the missing bridge. If the initial pool missed the relevant document, a selector cannot select it. A repairer can go fetch it.

At three slots, the issue becomes evidence composition

The $k=3$ HotpotQA setting is closer to everyday RAG usage. Three passages should be enough for many multi-hop questions, but only if the three passages are complementary. Three near-duplicates are not three pieces of evidence. They are one piece of evidence wearing different hats.

The paper reports that SEAL-RAG continues to lead in Judge-EM and precision at $k=3$:

Model backbone Best baseline Judge-EM SEAL-RAG Judge-EM Gain SEAL-RAG precision SEAL-RAG recall
gpt-4o-mini 63% 69% +6 pp 84% 44%
gpt-4o 71% 77% +6 pp 89% 68%
gpt-4.1-mini 67% 77% +10 pp 86% 49%
gpt-4.1 73% 76% +3 pp 91% 73%

The evidence pattern is the interesting part. Basic RAG can have decent recall because the right evidence may appear somewhere in the top three. But precision is much lower. CRAG’s corrective behavior can also introduce low-precision material. Self-RAG improves some cases but still does not enforce the same fixed-capacity replacement discipline.

SEAL-RAG’s advantage is not merely retrieving more gold evidence. In some rows, its recall is not always the highest. The point is that its evidence set is cleaner. Higher precision means fewer distractors per answer. For a generator, especially one asked to obey a grounding rule, that can matter more than raw recall.

This is a useful correction for enterprise teams. Many RAG dashboards over-celebrate recall. Recall is necessary, but recall without precision creates a different failure mode: the answer is present, but surrounded by enough junk that the model cannot reliably use it. A system that retrieves the truth and buries it under distractors has not solved retrieval. It has staged a scavenger hunt.

The 2Wiki results are the context-dilution warning label

The 2WikiMultiHopQA experiments are where the “more context is better” assumption gets the most direct test. The paper evaluates $k=1$, $k=3$, and $k=5$. The $k=5$ setting is especially revealing because increasing capacity should, in theory, help. More slots should mean more opportunities to include the right evidence.

In practice, the baselines show precision collapse. At $k=5$, Basic RAG reaches 34% precision. CRAG falls to 11% precision with GPT-4o-mini and 22% with GPT-4o. Self-RAG is better but still trails SEAL-RAG. SEAL-RAG maintains 89% precision with GPT-4o-mini and 96% with GPT-4o.

The reported 2Wiki results for SEAL-RAG are:

Retrieval depth Model SEAL-RAG Judge-EM SEAL-RAG precision SEAL-RAG recall SEAL-RAG F1
$k=1$ GPT-4o-mini 61% 92% 45% 59%
$k=1$ GPT-4o 76% 95% 75% 82%
$k=3$ GPT-4o-mini 64% 91% 46% 60%
$k=3$ GPT-4o 77% 97% 77% 84%
$k=5$ GPT-4o-mini 68% 89% 45% 59%
$k=5$ GPT-4o 74% 96% 77% 84%

The GPT-4o row at $k=5$ is the cleanest version of the paper’s argument: SEAL-RAG achieves 74% Judge-EM, 96% precision, 77% recall, and 84% F1. The system is not merely shrinking the context to protect precision. It is actively repairing the set so that precision and useful coverage coexist.

That distinction matters. A naive “small context” strategy can become brittle because it misses necessary evidence. A naive “large context” strategy can become noisy because it includes too much. SEAL-RAG’s claim is more specific: keep the final context fixed, but use the controller to improve what gets to occupy it.

Adaptive-k is a selector; SEAL-RAG is a repairer

The comparison with Adaptive-$k$ is especially useful for practitioners because Adaptive-$k$ sounds like the obvious answer to context dilution. Retrieve a larger pool, then dynamically choose the right cutoff. Why not let the system decide how much context is enough?

Because selection is bounded by the initial pool.

The paper compares SEAL-RAG with Adaptive-$k$ on 2WikiMultiHopQA. Adaptive-$k$ has two variants: a no-buffer version that aggressively cuts the context, and a buffer version that keeps extra documents for safety.

Model Method Judge-EM Precision Recall F1
GPT-4o-mini Adaptive-$k$ No Buffer 40.5% 86% 61% 65%
GPT-4o-mini Adaptive-$k$ Buffer 60.5% 26% 77% 38%
GPT-4o-mini SEAL-RAG 68.0% 89% 45% 59%
GPT-4o Adaptive-$k$ No Buffer 41.5% 86% 61% 65%
GPT-4o Adaptive-$k$ Buffer 66.5% 26% 77% 38%
GPT-4o SEAL-RAG 74.5% 96% 77% 84%

This table is a compact business lesson.

The no-buffer Adaptive-$k$ variant protects precision but misses too much answer-bearing evidence. The buffer variant improves recall but floods precision. SEAL-RAG does something different: it can leave the initial candidate pool through micro-queries. That lets it repair missing evidence rather than merely select among initially retrieved candidates.

The GPT-4o-mini row also deserves careful reading. SEAL-RAG’s recall is lower than the Adaptive-$k$ buffer variant, but its accuracy is higher. That means the cleaner evidence set compensates for lower raw recall in that setting. Again, the paper is not saying recall is unimportant. It is saying recall becomes less useful when the generator is forced to reason through noise.

The ablation shows this is not just “more loops”

One possible objection is that SEAL-RAG wins because it performs extra work. Maybe any iterative method would improve if given enough loops.

The loop-budget ablation argues against that lazy conclusion. On HotpotQA at $k=1$, the authors vary the repair loop budget. The average Judge-EM rises from 29% with no repair loop to 64% after one loop, 68% after two, and 70% after three. Across the tested backbones, the gain from zero to three loops ranges from +32 to +48 percentage points, but most of the improvement arrives immediately after the first repair step.

Loop budget Average Judge-EM on HotpotQA $k=1$
0 29%
1 64%
2 68%
3 70%

This is better read as an efficiency result than as an agentic-depth result. SEAL-RAG is not succeeding because it wanders through a long chain of self-reflection. It succeeds because the first structured gap is often enough: identify the missing bridge, issue a targeted micro-query, replace the wrong slot.

That matters for production. Long reflective loops create unstable latency and messy observability. A one-or-two-step repair process with a fixed final evidence budget is much easier to monitor, price, and explain.

The qualitative case studies support the same mechanism. In one example, the question asks for the Olympic host city in the year Blur released Parklife. The system extracts the release year 1994, flags the missing Olympic host entity, queries “1994 Olympic Games host city,” retrieves the 1994 Winter Olympics page, and replaces redundant album detail with the answer-bearing evidence. In another example, a comparison question requires birth dates for Margaret Atwood and Sofia Coppola. SEAL-RAG identifies the missing birth-date qualifier for Coppola and retrieves the relevant biography.

These examples are not separate proof. They are mechanism illustrations. Their purpose is to show how the controller turns “the answer is missing” into a specific repair action instead of a broad query rewrite.

What this means for enterprise RAG design

The direct paper result is about multi-hop QA benchmarks. The business inference is broader but should be stated carefully.

SEAL-RAG suggests that enterprise RAG teams should manage context slots as scarce evidence capacity. This applies most clearly when questions require connecting entities, dates, relationships, or attributes across documents. Think contract review, technical troubleshooting, compliance research, scientific literature review, procurement analysis, and internal policy QA. These workflows often fail not because no relevant document exists, but because the system retrieves plausible fragments without assembling the right chain.

A practical implementation inspired by SEAL-RAG would not necessarily copy every component. It would adopt the operating discipline:

Design question Additive RAG instinct SEAL-RAG-inspired discipline
What happens when the first retrieval is insufficient? Retrieve more and append. Diagnose the missing entity, relation, or qualifier.
How is context quality measured? Similarity score or general relevance. Gap closure, corroboration, novelty, and redundancy.
What happens when new evidence arrives? It joins the prompt. It must beat a current passage to enter the fixed buffer.
What does the generator see? A growing pile of possibly relevant text. A bounded evidence set curated for the question.
What is the cost profile? Variable and expansion-prone. More predictable final generation cost.

For business users, the strongest implication is not “use SEAL-RAG tomorrow.” The stronger implication is: stop evaluating RAG only by whether it retrieves something relevant somewhere. Evaluate whether the final evidence packet is sufficient, compact, and non-redundant.

A compliance assistant does not need twenty passages that all mention the same regulation. It needs the controlling provision, the relevant exception, the date or jurisdiction qualifier, and perhaps one corroborating source. A technical-support bot does not need five forum posts with similar symptoms. It needs the product version, the error condition, the root cause, and the fix. A research assistant does not need a mountain of abstracts. It needs the exact study, method, finding, and limitation that answer the question.

The pattern is simple: when the task has a chain, the evidence packet should reflect the chain.

What the paper directly shows, and what Cognaptus infers

It is worth separating evidence from interpretation.

Layer Claim Support level
Directly shown by the paper SEAL-RAG improves Judge-EM and evidence precision over Basic RAG, Self-RAG, CRAG, and Adaptive-$k$ in the tested HotpotQA and 2WikiMultiHopQA settings. Direct benchmark evidence.
Directly shown by the paper Replacement under fixed $k$ can outperform additive correction and passive pruning for multi-hop QA. Direct comparison under a shared environment.
Directly shown by the paper The first repair loop delivers most of the measured improvement in the HotpotQA $k=1$ ablation. Direct ablation evidence.
Cognaptus inference Enterprise RAG systems should treat final context as a curated evidence packet, not an expandable memory buffer. Strong practical inference from the mechanism and results.
Cognaptus inference Legal, compliance, technical-support, and research workflows may benefit from explicit gap diagnosis and replacement policies. Plausible but domain-dependent.
Still uncertain Whether the same gains hold in messy enterprise corpora with weak metadata, inconsistent aliases, scanned PDFs, tables, or confidential documents. Requires domain-specific validation.

This distinction is not academic hair-splitting. It prevents the usual slide-deck disease: a benchmark result becomes a universal product claim by lunchtime.

The paper gives a strong design signal. It does not remove the need to test on your own corpus, your own query distribution, and your own failure costs.

The boundary: not every task wants replacement

SEAL-RAG is built for fixed-budget precision in multi-hop reasoning. That is not the same as universal retrieval.

The authors identify several limitations that matter for deployment.

First, the controller depends on being able to name the missing information. Missing entity, missing relation, and missing qualifier are good fit categories. Abstract gaps are harder. A query like “What was the general market sentiment around this strategic decision?” may not reduce cleanly into a micro-query for one missing relation. Sentiment, strategy, and institutional mood are not always ledger-friendly.

Second, fixed capacity is a deliberate trade-off. If the user genuinely asks for an exhaustive list—“show all relevant filings,” “list every supplier contract affected by this clause,” or “summarize all twenty studies in this review”—then replacement may be the wrong default. The system should accumulate, cluster, and summarize, not evict evidence just because $k$ is full.

Third, alias handling remains a hard problem. The paper’s failure case shows an entity-linking mismatch around Apple-related aliases. In enterprise corpora, aliases are worse: product nicknames, old company names, internal project codes, abbreviations, spelling variants, and OCR errors. An entity ledger without strong normalization can become confidently confused. Very elegant. Very dangerous.

Fourth, the evaluation uses LLM-based judging. The paper applies a strict protocol and statistical tests, but high-stakes business settings should still include human evaluation, audit trails, and task-specific acceptance tests. A clean benchmark win is a reason to prototype, not a reason to remove oversight.

These boundaries do not weaken the paper’s main idea. They clarify where the idea should be used. SEAL-RAG is best understood as a controller pattern for precision-sensitive, multi-hop evidence assembly under budget. It is not an all-purpose replacement for search, summarization, discovery, or exhaustive review.

The larger lesson: RAG needs evidence operations

The fashionable RAG question is often: which embedding model, which vector database, which reranker, which context length?

Those choices matter. But this paper points to a less glamorous layer: evidence operations.

Evidence operations asks what happens after retrieval but before generation. Is the evidence sufficient? Which entity is missing? Which relation is unsupported? Which passage is redundant? Which candidate deserves a slot? Which passage should be evicted? When should the system stop?

Most production RAG failures live in this layer. The retriever returns something plausible. The generator sounds confident. The interface looks finished. But the system never assembled the actual evidence chain. It retrieved around the answer instead of into the answer.

SEAL-RAG’s value is that it gives this middle layer a concrete shape. The entity ledger makes evidence state visible. The sufficiency gate turns “seems enough” into a decision. The micro-query policy turns missing information into targeted search. The replacement rule turns context management into a budgeted optimization problem.

That is not as glamorous as announcing a bigger model. It is more useful.

Conclusion: throwing things away is an intelligence feature

The paper’s title says “Replace, Don’t Expand,” and for once the slogan is not doing all the work. SEAL-RAG demonstrates a specific mechanism: fixed-budget evidence assembly through entity-aware gap diagnosis and replacement. The evidence suggests that, for multi-hop QA, this mechanism improves both answer correctness and evidence precision under strict context budgets.

The uncomfortable lesson is that many RAG systems are too polite to delete. They retrieve a weak passage, discover a better one, and then keep both because throwing things away feels risky. But in reasoning systems, clutter is not neutral. A distractor consumes attention. A duplicate occupies a slot. A broad rewrite can bury the bridge fact under topical fog.

In enterprise AI, the future of RAG will not be won only by larger context windows. It will be won by systems that know what evidence they have, what evidence they lack, and what evidence no longer deserves to stay.

SEAL-RAG is a useful reminder that intelligence is not just the ability to collect information. Sometimes it is the ability to throw the wrong information away.

Cognaptus: Automate the Present, Incubate the Future.


  1. Moshe Lahmy and Roi Yozevitch, “Replace, Don’t Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly,” arXiv:2512.10787, https://arxiv.org/html/2512.10787 ↩︎