Opening — Why this matters now

Multi-hop reasoning has quietly become one of the most expensive habits in modern AI systems. Every additional hop—every “and then what?”—typically triggers another retrieval, another prompt expansion, another LLM call. Accuracy improves, yes, but so does the bill.

CompactRAG enters this conversation with a refreshingly unfashionable claim: most of this cost is structural, not inevitable. If you stop forcing LLMs to repeatedly reread the same knowledge, multi-hop reasoning does not have to scale linearly in tokens—or in money.

Background — The inefficiency tax of iterative RAG

Classic RAG pipelines were never designed for compositional reasoning. When multi-hop benchmarks like HotpotQA and 2WikiMultiHopQA arrived, the community responded by stacking loops:

  • Retrieve → reason → retrieve again
  • Repeat until the answer stabilizes—or the context window explodes

Methods such as Self-Ask, IRCoT, and Iter-RetGen improved reasoning transparency, but at a steep price. Token usage scales with hop depth. Entity grounding degrades as pronouns creep in. Latency becomes unpredictable. In production settings, these are not minor inconveniences—they are deployment blockers.

Analysis — What CompactRAG actually changes

CompactRAG’s core move is architectural, not algorithmic flair.

1. Knowledge is compacted before inference

Instead of retrieving raw passages at inference time, CompactRAG preprocesses the corpus offline into atomic QA pairs. Each pair represents a single, minimal fact:

Raw Corpus Atomic QA
Paragraphs with overlap One fact, one answer
Redundant context Zero redundancy
Document-level Fact-level

This step is done once. The cost is amortized.

2. Online reasoning is LLM-light by design

At inference time, the LLM is called exactly twice:

  1. Question decomposition — break a complex query into dependency-ordered sub-questions
  2. Final synthesis — integrate resolved facts into an answer

Everything in between—retrieval, entity grounding, disambiguation—is handled by lightweight models:

  • A RoBERTa-based Answer Extractor for span selection
  • A Flan-T5-based Question Rewriter to eliminate entity drift

No iterative prompting. No growing context windows. No surprise token spikes.

Findings — Accuracy without the token hemorrhage

Across HotpotQA, 2WikiMultiHopQA, and MuSiQue, CompactRAG delivers a rare combination: competitive accuracy with dramatically lower cost.

Token efficiency snapshot

Method Tokens / Query
IRCoT ~10.2K
Iter-RetGen ~4.7K
CompactRAG ~1.9K

Accuracy remains within striking distance of heavier baselines—and improves further when the offline QA base is built with stronger readers like GPT-4.

The implication is uncomfortable for brute-force scaling strategies: better structure beats bigger prompts.

Implications — Why this matters beyond benchmarks

CompactRAG quietly reframes how we should think about RAG systems in production:

  • Cost predictability: Fixed LLM calls mean stable inference budgets
  • Scalability: Offline preprocessing scales with data size, not query volume
  • Model agnosticism: No reliance on hidden activations or proprietary signals

For enterprises deploying RAG at scale, this is less a research tweak and more a budgeting strategy.

Conclusion — The case for boring efficiency

CompactRAG is not flashy. It does not introduce exotic planners or clever uncertainty heuristics. Instead, it does something rarer in modern AI research: it respects operational reality.

By separating knowledge preparation from reasoning execution, it shows that multi-hop QA does not have to be a token bonfire. As LLM costs come under sharper scrutiny, architectures like this may age better than yet another iterative loop.

Cognaptus: Automate the Present, Incubate the Future.