Opening — Why this matters now

RAG systems are having an identity crisis. On paper, retrieval-augmented generation is supposed to ground large language models in facts. In practice, when queries require multi-hop reasoning, most systems panic and start hoarding context like it’s a survival skill. Add more passages. Expand the window. Hope the model figures it out.

It doesn’t.

What actually happens—especially under real-world latency and token budgets—is context dilution: relevant evidence gets buried under polite but useless paragraphs, and the model’s reasoning collapses. The paper “Replace, Don’t Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly” takes direct aim at this failure mode and proposes something refreshingly disciplined: stop adding documents, and start replacing them. fileciteturn0file0

Background — Context windows are not junk drawers

Classic RAG follows a simple rhythm: retrieve top‑k documents, stuff them into the prompt, generate an answer. This works passably for single-hop questions. But multi-hop queries—where one fact is needed to discover the next—expose a structural weakness. If the initial retrieval misses a bridge entity, generation fails.

Recent fixes like Self‑RAG and CRAG introduce feedback loops: critique the answer, re-query, append more context. Adaptive‑k takes a different route, pruning a large retrieved pool down to a smaller subset. Despite surface differences, these approaches share an assumption that more context is either harmless or actively helpful.

Empirically, this assumption is wrong. Long-context models struggle to ignore irrelevant text. Precision collapses as k grows. The system technically “has the answer,” but cannot see it clearly enough to use it.

Analysis — What SEAL‑RAG actually changes

The core contribution of this paper is not a new retriever, extractor, or verifier. It is a controller—SEAL‑RAG—that treats the evidence window as a fixed-capacity resource rather than an expandable buffer.

SEAL‑RAG operates through a loop the authors call Search → Extract → Assess → Loop:

  1. Search initializes a fixed top‑k evidence set.
  2. Extract converts unstructured text into an explicit entity ledger—entities, relations, and qualifiers that are actually supported by the evidence.
  3. Assess checks whether the ledger is sufficient to answer the question.
  4. Loop triggers targeted micro‑queries only for what is missing.

Here is the key shift: when new evidence arrives, something else must leave.

Instead of appending passages, SEAL‑RAG scores candidates with an entity‑first utility function that balances:

  • Gap coverage (does this close a missing hop?)
  • Corroboration (does it confirm uncertain facts?)
  • Novelty (is it non-redundant?)
  • Redundancy penalties (is this just saying the same thing again?)

Low‑utility passages are evicted. Context size stays constant. Information density increases.

Findings — Precision beats accumulation

The results are uncomfortable reading for anyone who believes in brute-force context expansion.

HotpotQA (multi-hop, k = 1–3)

Method k Accuracy (Judge‑EM) Evidence Precision
Basic RAG 1 ~40% ~85%
Self‑RAG 1 ~50–60% ~70%
CRAG 1 ~55–58% ~40–55%
SEAL‑RAG 1 62–73% 87–91%

With only one evidence slot, additive methods routinely retrieve the wrong document and then proudly keep it. SEAL‑RAG replaces it.

2WikiMultiHopQA (k = 5, the dilution test)

Method Accuracy Precision
Basic RAG ~62% ~34%
Self‑RAG ~60% ~45–63%
CRAG ~55–64% 11–22%
Adaptive‑k ~41–66% 26–86%
SEAL‑RAG 74% 96%

CRAG’s precision collapse is particularly instructive: adding more context actively harms reasoning. SEAL‑RAG maintains precision precisely because it refuses to grow the window.

Implications — Controllers matter more than models

This paper quietly reframes how practitioners should think about RAG systems:

  • Context windows are scarce cognitive resources, not free memory.
  • Recall without precision is useless—the model cannot reason over noise.
  • Repair beats selection: pruning a bad pool is inferior to fetching what is missing.
  • Predictable cost profiles matter: fixed‑k replacement keeps latency and token usage bounded.

For production systems—customer support, compliance, research assistants—SEAL‑RAG’s discipline is arguably more valuable than another marginally larger model.

Conclusion — Less context, better answers

The most interesting thing about SEAL‑RAG is how unglamorous it is. No new training. No architectural bravado. Just a refusal to believe that accumulation equals intelligence.

By replacing instead of expanding, SEAL‑RAG turns multi-hop RAG from a cluttered inbox into a curated brief. And in doing so, it exposes an uncomfortable truth: most RAG failures are not retrieval failures—they are control failures.

Cognaptus: Automate the Present, Incubate the Future.