Opening — Why this matters now
RAG systems are having an identity crisis. On paper, retrieval-augmented generation is supposed to ground large language models in facts. In practice, when queries require multi-hop reasoning, most systems panic and start hoarding context like it’s a survival skill. Add more passages. Expand the window. Hope the model figures it out.
It doesn’t.
What actually happens—especially under real-world latency and token budgets—is context dilution: relevant evidence gets buried under polite but useless paragraphs, and the model’s reasoning collapses. The paper “Replace, Don’t Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly” takes direct aim at this failure mode and proposes something refreshingly disciplined: stop adding documents, and start replacing them. fileciteturn0file0
Background — Context windows are not junk drawers
Classic RAG follows a simple rhythm: retrieve top‑k documents, stuff them into the prompt, generate an answer. This works passably for single-hop questions. But multi-hop queries—where one fact is needed to discover the next—expose a structural weakness. If the initial retrieval misses a bridge entity, generation fails.
Recent fixes like Self‑RAG and CRAG introduce feedback loops: critique the answer, re-query, append more context. Adaptive‑k takes a different route, pruning a large retrieved pool down to a smaller subset. Despite surface differences, these approaches share an assumption that more context is either harmless or actively helpful.
Empirically, this assumption is wrong. Long-context models struggle to ignore irrelevant text. Precision collapses as k grows. The system technically “has the answer,” but cannot see it clearly enough to use it.
Analysis — What SEAL‑RAG actually changes
The core contribution of this paper is not a new retriever, extractor, or verifier. It is a controller—SEAL‑RAG—that treats the evidence window as a fixed-capacity resource rather than an expandable buffer.
SEAL‑RAG operates through a loop the authors call Search → Extract → Assess → Loop:
- Search initializes a fixed top‑k evidence set.
- Extract converts unstructured text into an explicit entity ledger—entities, relations, and qualifiers that are actually supported by the evidence.
- Assess checks whether the ledger is sufficient to answer the question.
- Loop triggers targeted micro‑queries only for what is missing.
Here is the key shift: when new evidence arrives, something else must leave.
Instead of appending passages, SEAL‑RAG scores candidates with an entity‑first utility function that balances:
- Gap coverage (does this close a missing hop?)
- Corroboration (does it confirm uncertain facts?)
- Novelty (is it non-redundant?)
- Redundancy penalties (is this just saying the same thing again?)
Low‑utility passages are evicted. Context size stays constant. Information density increases.
Findings — Precision beats accumulation
The results are uncomfortable reading for anyone who believes in brute-force context expansion.
HotpotQA (multi-hop, k = 1–3)
| Method | k | Accuracy (Judge‑EM) | Evidence Precision |
|---|---|---|---|
| Basic RAG | 1 | ~40% | ~85% |
| Self‑RAG | 1 | ~50–60% | ~70% |
| CRAG | 1 | ~55–58% | ~40–55% |
| SEAL‑RAG | 1 | 62–73% | 87–91% |
With only one evidence slot, additive methods routinely retrieve the wrong document and then proudly keep it. SEAL‑RAG replaces it.
2WikiMultiHopQA (k = 5, the dilution test)
| Method | Accuracy | Precision |
|---|---|---|
| Basic RAG | ~62% | ~34% |
| Self‑RAG | ~60% | ~45–63% |
| CRAG | ~55–64% | 11–22% |
| Adaptive‑k | ~41–66% | 26–86% |
| SEAL‑RAG | 74% | 96% |
CRAG’s precision collapse is particularly instructive: adding more context actively harms reasoning. SEAL‑RAG maintains precision precisely because it refuses to grow the window.
Implications — Controllers matter more than models
This paper quietly reframes how practitioners should think about RAG systems:
- Context windows are scarce cognitive resources, not free memory.
- Recall without precision is useless—the model cannot reason over noise.
- Repair beats selection: pruning a bad pool is inferior to fetching what is missing.
- Predictable cost profiles matter: fixed‑k replacement keeps latency and token usage bounded.
For production systems—customer support, compliance, research assistants—SEAL‑RAG’s discipline is arguably more valuable than another marginally larger model.
Conclusion — Less context, better answers
The most interesting thing about SEAL‑RAG is how unglamorous it is. No new training. No architectural bravado. Just a refusal to believe that accumulation equals intelligence.
By replacing instead of expanding, SEAL‑RAG turns multi-hop RAG from a cluttered inbox into a curated brief. And in doing so, it exposes an uncomfortable truth: most RAG failures are not retrieval failures—they are control failures.
Cognaptus: Automate the Present, Incubate the Future.