Opening — Why this matters now
Multi-hop reasoning has quietly become one of the most expensive habits in modern AI systems. Every additional hop—every “and then what?”—typically triggers another retrieval, another prompt expansion, another LLM call. Accuracy improves, yes, but so does the bill.
CompactRAG enters this conversation with a refreshingly unfashionable claim: most of this cost is structural, not inevitable. If you stop forcing LLMs to repeatedly reread the same knowledge, multi-hop reasoning does not have to scale linearly in tokens—or in money.
Background — The inefficiency tax of iterative RAG
Classic RAG pipelines were never designed for compositional reasoning. When multi-hop benchmarks like HotpotQA and 2WikiMultiHopQA arrived, the community responded by stacking loops:
- Retrieve → reason → retrieve again
- Repeat until the answer stabilizes—or the context window explodes
Methods such as Self-Ask, IRCoT, and Iter-RetGen improved reasoning transparency, but at a steep price. Token usage scales with hop depth. Entity grounding degrades as pronouns creep in. Latency becomes unpredictable. In production settings, these are not minor inconveniences—they are deployment blockers.
Analysis — What CompactRAG actually changes
CompactRAG’s core move is architectural, not algorithmic flair.
1. Knowledge is compacted before inference
Instead of retrieving raw passages at inference time, CompactRAG preprocesses the corpus offline into atomic QA pairs. Each pair represents a single, minimal fact:
| Raw Corpus | Atomic QA |
|---|---|
| Paragraphs with overlap | One fact, one answer |
| Redundant context | Zero redundancy |
| Document-level | Fact-level |
This step is done once. The cost is amortized.
2. Online reasoning is LLM-light by design
At inference time, the LLM is called exactly twice:
- Question decomposition — break a complex query into dependency-ordered sub-questions
- Final synthesis — integrate resolved facts into an answer
Everything in between—retrieval, entity grounding, disambiguation—is handled by lightweight models:
- A RoBERTa-based Answer Extractor for span selection
- A Flan-T5-based Question Rewriter to eliminate entity drift
No iterative prompting. No growing context windows. No surprise token spikes.
Findings — Accuracy without the token hemorrhage
Across HotpotQA, 2WikiMultiHopQA, and MuSiQue, CompactRAG delivers a rare combination: competitive accuracy with dramatically lower cost.
Token efficiency snapshot
| Method | Tokens / Query |
|---|---|
| IRCoT | ~10.2K |
| Iter-RetGen | ~4.7K |
| CompactRAG | ~1.9K |
Accuracy remains within striking distance of heavier baselines—and improves further when the offline QA base is built with stronger readers like GPT-4.
The implication is uncomfortable for brute-force scaling strategies: better structure beats bigger prompts.
Implications — Why this matters beyond benchmarks
CompactRAG quietly reframes how we should think about RAG systems in production:
- Cost predictability: Fixed LLM calls mean stable inference budgets
- Scalability: Offline preprocessing scales with data size, not query volume
- Model agnosticism: No reliance on hidden activations or proprietary signals
For enterprises deploying RAG at scale, this is less a research tweak and more a budgeting strategy.
Conclusion — The case for boring efficiency
CompactRAG is not flashy. It does not introduce exotic planners or clever uncertainty heuristics. Instead, it does something rarer in modern AI research: it respects operational reality.
By separating knowledge preparation from reasoning execution, it shows that multi-hop QA does not have to be a token bonfire. As LLM costs come under sharper scrutiny, architectures like this may age better than yet another iterative loop.
Cognaptus: Automate the Present, Incubate the Future.