Ask a normal enterprise RAG system a simple factual question, and it behaves politely enough. Retrieve a few passages. Hand them to the model. Generate an answer. Fine.
Ask it a question that requires two or three steps, and the machine starts developing expensive habits.
It retrieves, reasons, retrieves again, expands the prompt, reasons again, rewrites a query, retrieves more evidence, and then asks the LLM to stitch the mess together. The architecture looks intellectually serious. The invoice looks even more serious.
CompactRAG, a 2026 paper titled CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering, takes aim at that habit.1 Its central claim is refreshingly unsentimental: multi-hop reasoning does not necessarily require the LLM to participate in every hop. Some of the work currently done online, repeatedly, and expensively can be moved offline, compacted into smaller knowledge units, and reused.
That sounds simple. It is also the point. A large portion of RAG cost is not caused by reasoning itself. It is caused by letting the system rediscover, reread, and re-ground the same information at inference time.
The paper’s practical value is not that it invents a magical new reasoning engine. It does something more useful: it changes where the cost sits.
The old RAG bargain: better reasoning, worse cost behavior
Standard RAG was built around a straightforward bargain. When the model needs factual grounding, retrieve relevant context and let the LLM answer with the retrieved evidence in view. For single-hop questions, this bargain is tolerable.
Multi-hop QA breaks the calm.
A multi-hop question does not ask for one isolated fact. It asks the system to connect facts. For example: identify the person who did something, then find where that person was born; identify two related entities, then compare their dates; resolve an intermediate entity, then use that entity to retrieve the final answer.
The usual response has been iterative RAG. Methods such as Self-Ask, IRCoT, and Iter-RetGen alternate between retrieval and reasoning. The LLM decomposes or reasons over the question, retrieves evidence, updates its reasoning, retrieves again, and eventually produces a final answer.
This design has an obvious appeal. Each step can use the previous step’s output. The system appears to reason in a controlled chain rather than trying to solve the whole problem at once.
It also creates three operational problems.
First, token use grows with the number of hops. Each retrieval step adds more context. Each reasoning step consumes another prompt and another generation. The system does not just think harder; it pays harder.
Second, latency becomes less predictable. A question that looks innocent may trigger several rounds of retrieval and generation. For an internal demo, this is mildly annoying. For a production assistant serving thousands of queries, it is budgeting with a blindfold.
Third, entity grounding can drift. A decomposed question like “Who discovered penicillin?” can be followed by “Where was he born?” Humans know “he” refers to Alexander Fleming. Retrieval systems do not enjoy pronouns nearly as much. If the next retrieval query loses the entity, the system may search the wrong neighborhood and return a confident answer from the wrong evidence. Charming, in the way only automated errors can be.
CompactRAG’s contribution starts from a blunt observation: if repeated online LLM calls are the problem, stop designing the pipeline as if every hop deserves one.
CompactRAG’s core move: prepare knowledge before the user asks
CompactRAG separates the system into two stages.
The first stage happens offline. The raw corpus is processed once by an LLM and converted into an atomic QA knowledge base. Instead of storing and retrieving long raw passages, the system creates small question-answer pairs, each intended to represent a minimal factual unit.
This is not merely compression in the ZIP-file sense. It is semantic restructuring. A paragraph may contain several facts, overlapping descriptions, repeated names, and background context. CompactRAG asks an LLM reader to turn that material into fact-level QA pairs such as:
| Raw corpus form | CompactRAG offline form |
|---|---|
| Long passages with overlapping context | Atomic QA pairs |
| Document-level retrieval units | Fact-level retrieval units |
| Redundant background repeated across passages | Smaller semantic units |
| Implicit entity references | Entity-explicit questions and answers |
The paper’s offline prompt instructs the reader model to generate independent atomic facts and QA pairs, use explicit entity names, avoid pronouns or vague references, and avoid adding information not present in the passage. The goal is to create retrieval units that look more like the questions the system will later ask.
This is the first important mechanism. CompactRAG does not try to make the online model smarter by feeding it more text. It tries to make the retrieved knowledge less wasteful before the online model sees anything.
For enterprise RAG, this is the part that should make architects pause. Many company knowledge bases are relatively stable: manuals, policies, technical documentation, procurement rules, compliance notes, product FAQs, historical ticket archives. If those sources are queried repeatedly, then paying a one-time preprocessing cost may be more rational than paying for repeated raw-passage retrieval and LLM reasoning on every user query.
The phrase “one-time” should not be romanticized. Corpora change. Documents expire. Policies get rewritten by committees with suspicious enthusiasm. But the economic idea is clear: if a knowledge base is reused often enough, offline compaction can be amortized.
Online inference becomes a two-call pipeline
Once the atomic QA base exists, CompactRAG handles each user query online with a fixed number of main LLM calls: two.
The first LLM call decomposes the user’s complex question into dependency-ordered sub-questions. The second LLM call synthesizes the final answer after the sub-questions have been resolved.
Everything between those two calls is handled by retrieval and lightweight modules.
The online flow looks like this:
User multi-hop question
↓
LLM call 1: decompose into dependency-ordered sub-questions
↓
Retrieve atomic QA pairs for each sub-question
↓
RoBERTa-based Answer Extractor selects answer spans
↓
Flan-T5-small Sub-Question Rewriter preserves entity grounding
↓
LLM call 2: synthesize final answer from resolved evidence
The important phrase is “fixed number.” In CompactRAG, the main LLM is invoked twice regardless of the number of reasoning hops. That is the design contrast with iterative RAG, where deeper reasoning normally means more model calls.
This does not mean computation disappears. Retrieval still happens. Lightweight models still run. Offline QA construction still costs tokens. But the expensive online LLM loop is no longer allowed to expand freely.
For business use, that distinction matters more than a benchmark leaderboard position. A system with predictable inference cost can be planned, priced, monitored, and scaled. A system whose cost grows quietly with query complexity is harder to manage, especially when users do what users always do: ask vague questions and then blame the system for taking them literally.
The answer extractor and rewriter are not decoration
The paper’s two lightweight online modules are easy to skip over, but they are where CompactRAG protects itself from becoming a brittle decomposition trick.
The Answer Extractor is based on RoBERTa-base. Given a sub-question and retrieved candidate QA pairs, it predicts the answer span inside the retrieved evidence. Its job is not to generate a fluent answer. Its job is to identify the relevant entity or fact under retrieval noise.
The Sub-Question Rewriter uses Flan-T5-small. Its job is to rewrite ambiguous sub-questions by inserting the resolved entity from a previous step. So “Where was he born?” becomes “Where was Albert Einstein born?” The example is simple because the failure mode is simple. Ambiguity is cheap to create and expensive to debug.
These modules matter because CompactRAG removes the LLM from the middle of the reasoning chain. If the system no longer asks a large model to manage every hop, it needs smaller modules to preserve continuity. Otherwise, the decomposition stage can produce sub-questions that are individually grammatical but operationally useless.
The paper’s ablation study supports this interpretation. Removing the rewriter lowers accuracy across all three benchmarks. Removing both the extractor and rewriter lowers it further.
| Configuration | HotpotQA Acc | 2WikiMultiHopQA Acc | MuSiQue Acc | Likely purpose of test |
|---|---|---|---|---|
| CompactRAG full | 70.4 | 53.2 | 41.2 | Main system configuration |
| Without rewriter | 63.2 | 48.8 | 35.8 | Ablation: isolates entity-rewriting value |
| Without extractor and rewriter | 58.4 | 44.2 | 32.6 | Ablation: tests whether raw decomposed sub-questions are enough |
The result is not subtle. The rewriter is not a cosmetic text-cleaning module. It is part of the grounding mechanism. The extractor and rewriter together make it possible to reduce LLM calls without simply throwing away multi-hop discipline.
This is the right way to read the ablation: it does not prove that these are the best possible lightweight modules. It proves that CompactRAG’s efficiency depends on replacing repeated LLM reasoning with some form of local entity-preserving machinery. Remove that machinery, and the architecture starts paying for its simplicity in accuracy.
The main result: competitive accuracy at much lower token cost
CompactRAG is evaluated on HotpotQA, 2WikiMultiHopQA, and the answerable subset of MuSiQue. The authors sample 250 development-set questions from each benchmark, preserve the original distribution of question types and difficulty, and compare against Vanilla RAG, Self-Ask, IRCoT, and Iter-RetGen under shared retrieval settings.
The headline number is token use.
| Method | Token / sample |
|---|---|
| Vanilla RAG | 2.7K |
| Self-Ask | 6.9K |
| IRCoT | 10.2K |
| Iter-RetGen | 4.7K |
| CompactRAG | 1.9K |
CompactRAG uses about 1.9K tokens per query during inference. This is lower than Vanilla RAG’s 2.7K and far lower than the stronger iterative baselines: 4.7K for Iter-RetGen and 10.2K for IRCoT.
Accuracy is more nuanced, and this is where the paper is more interesting than a simple “faster and better” story.
With a LLaMA-3.1-8B reader used to construct the offline QA base, CompactRAG achieves:
| Method | HotpotQA F1 / Acc | 2WikiMultiHopQA F1 / Acc | MuSiQue F1 / Acc |
|---|---|---|---|
| IRCoT | 48.95 / 65.20 | 48.99 / 48.80 | 29.08 / 32.40 |
| Iter-RetGen | 52.24 / 72.40 | 59.73 / 61.20 | 32.42 / 40.00 |
| CompactRAG, LLaMA-3.1-8B Reading | 66.21 / 70.40 | 49.62 / 53.20 | 37.63 / 41.20 |
| CompactRAG, GPT-4 Reading | 69.54 / 77.20 | 55.67 / 57.20 | 42.34 / 43.60 |
On HotpotQA, CompactRAG with LLaMA-based offline reading trails Iter-RetGen slightly on LLM-judged accuracy, 70.40 versus 72.40, but it has much higher F1, 66.21 versus 52.24, while using far fewer tokens. With GPT-4 used for offline QA construction, CompactRAG reaches 77.20 accuracy.
On 2WikiMultiHopQA, Iter-RetGen remains stronger than CompactRAG in the LLaMA-reading setting: 61.20 accuracy versus 53.20. CompactRAG is not a universal accuracy winner. That should be said plainly, because pretending otherwise is how research summaries turn into vendor brochures. However, the GPT-4-reading variant narrows the gap, reaching 57.20 accuracy.
On MuSiQue, CompactRAG performs best among the listed systems in both LLaMA-reading and GPT-4-reading configurations: 41.20 and 43.60 accuracy respectively, compared with 40.00 for Iter-RetGen and 32.40 for IRCoT.
So the correct interpretation is not “CompactRAG dominates every baseline.” It does not.
The better interpretation is this: CompactRAG offers a favorable accuracy-cost tradeoff. It gives competitive multi-hop QA performance while sharply reducing online token consumption. In some settings it wins outright; in others it trades a little accuracy for a large efficiency gain. That trade is often exactly what production systems care about, assuming the lost accuracy is acceptable for the task.
Research papers chase best scores. Businesses chase usable systems. These are related activities, but they are not the same sport.
The GPT-4 offline variant reveals where quality enters the system
One of the most useful parts of the paper is the comparison between two CompactRAG variants: one where the atomic QA base is built using LLaMA-3.1-8B, and another where it is built using GPT-4.
The online inference setup remains based on the same general CompactRAG idea, but the offline reader quality changes. The GPT-4-reading variant improves results across all three benchmarks:
| Benchmark | LLaMA-reading Acc | GPT-4-reading Acc | Interpretation |
|---|---|---|---|
| HotpotQA | 70.40 | 77.20 | Better offline QA construction improves downstream reasoning |
| 2WikiMultiHopQA | 53.20 | 57.20 | Gains appear, but do not fully surpass Iter-RetGen |
| MuSiQue | 41.20 | 43.60 | Compact QA quality helps on harder composed questions |
This is not just an upper-bound curiosity. It tells us where the system’s bottleneck may sit.
If the atomic QA base is noisy, incomplete, or poorly grounded, online retrieval cannot magically recover all missing structure. CompactRAG shifts work offline, but the offline work has to be good. A badly compacted knowledge base is not efficient. It is just wrong faster.
For enterprise adoption, this suggests a concrete design principle: treat the offline QA construction stage as an asset-building process, not a disposable preprocessing script. It needs evaluation, versioning, refresh logic, and probably human review for high-stakes domains.
In other words, CompactRAG does not eliminate quality assurance. It moves quality assurance to a place where it can be done repeatedly, systematically, and before the user is staring at a loading spinner.
What each experiment actually supports
The paper includes main benchmark comparisons, token-efficiency analysis, ablations, and additional appendix token plots. These do not all support the same kind of claim.
| Evidence | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main benchmark table | Main evidence and comparison with prior work | CompactRAG is competitive on multi-hop QA while using fewer online tokens | That CompactRAG is always more accurate than iterative RAG |
| Token / sample comparison | Efficiency evidence | Fixed two-call design substantially lowers inference token use | Total lifecycle cost under every corpus-update schedule |
| Cumulative token plots | Deployment-style cost analysis | Offline preprocessing can be amortized as query volume grows | Exact break-even point for arbitrary enterprise corpora |
| Per-query token plots | Robustness/sensitivity of efficiency behavior | CompactRAG has lower and more stable per-query token consumption | That latency will always be lower in every hardware setup |
| Extractor / rewriter ablation | Ablation | Local grounding modules are important for preserving accuracy | That RoBERTa and Flan-T5-small are uniquely optimal choices |
| GPT-4-reading variant | Upper-bound / quality sensitivity test | Better offline QA generation improves downstream performance | That every organization should use GPT-4 for preprocessing |
This distinction matters because the business reader should not walk away with the cartoon version of the result.
The paper directly shows lower inference token consumption and competitive benchmark performance under the authors’ experimental setup. Cognaptus can infer that this architecture may be attractive for high-volume enterprise QA over stable corpora. But the paper does not prove a universal cost advantage for every deployment. If the corpus changes hourly, or if the organization asks only a few multi-hop queries per month, the offline compaction cost may not amortize well.
That boundary does not weaken the paper. It makes the paper usable.
The business value is cost predictability, not just lower cost
The obvious business implication is cheaper RAG. That is true, but incomplete.
The more important implication is cost predictability.
A production AI system is not evaluated only by average token usage. It is evaluated by whether its operating cost can be forecast, whether latency is stable, whether failures can be diagnosed, and whether the architecture can be maintained without turning into a procedural swamp.
CompactRAG improves the shape of the system in four ways.
| Technical design choice | Operational consequence | ROI relevance |
|---|---|---|
| Offline atomic QA construction | Reusable compact knowledge layer | Converts repeated inference cost into amortizable preparation cost |
| Two fixed online LLM calls | More predictable token budget | Easier pricing and capacity planning |
| Lightweight extractor and rewriter | Less dependence on repeated LLM reasoning | Lower online serving cost and potentially lower latency |
| Entity-explicit rewriting | Reduced entity drift across hops | Fewer wrong answers caused by ambiguous sub-questions |
For a company building an internal analyst assistant, legal search tool, technical support copilot, procurement Q&A system, or research knowledge assistant, this is the kind of architecture that deserves attention. Not because it is glamorous. It is not. It is closer to good database design wearing an LLM jacket.
But glamour is a poor deployment criterion. Stable cost curves usually age better.
Where CompactRAG fits best
CompactRAG is most attractive when four conditions hold.
First, the corpus is large enough and queried often enough that offline preprocessing can be amortized. A small internal FAQ with low query volume probably does not need this machinery. A large technical-documentation base queried thousands of times per week is a better candidate.
Second, the corpus is relatively stable. CompactRAG can handle updates in principle, but frequent document churn complicates the economics. If the knowledge base changes constantly, the system needs incremental regeneration, stale-QA detection, and version control.
Third, questions often require entity chaining. CompactRAG is designed for multi-hop reasoning. If users mostly ask single-hop lookup questions, simpler RAG may be sufficient.
Fourth, the organization can evaluate the compacted QA base. This is crucial. The offline layer becomes a derived knowledge asset. If the generation process omits important details, creates awkward QA pairs, or loses domain-specific nuance, the online pipeline inherits those errors.
The best fit is not “every RAG system.” The best fit is high-volume, multi-hop, knowledge-intensive QA over corpora that can be periodically compacted and tested.
Where the paper leaves uncertainty
CompactRAG’s limitations are not unusual, but they matter for implementation.
The evaluation uses sampled benchmark subsets of 250 questions per dataset. That is reasonable given LLM inference cost, but it still leaves open how performance scales across broader real-world query distributions.
The corpora in the experiments are benchmark contexts, not messy enterprise repositories full of PDFs, slide decks, duplicated pages, outdated policy fragments, and table-heavy documents. Anyone who has worked with internal knowledge bases knows that “retrieve from corpus” becomes much less elegant once the corpus includes scanned annexes and meeting notes named “final_final_v3_updated_real.docx.”
The paper also shows that offline reader quality matters. The GPT-4-reading variant improves accuracy, which is useful evidence but also a reminder: the system’s knowledge compaction stage is not free from model-quality constraints. A weaker reader may produce weaker atomic QA units. A stronger reader may cost more upfront.
Finally, the token metric focuses on inference consumption. For deployment, total cost should include offline preprocessing, embedding, storage, refresh cycles, local model serving, monitoring, and evaluation. CompactRAG’s economic argument is strongest when query volume is high enough to spread the offline cost across many interactions.
None of these boundaries undermine the central result. They define the deployment envelope.
The real lesson: stop treating online reasoning as the only place intelligence can live
CompactRAG’s deeper message is architectural. RAG systems do not need to place all intelligence inside the online LLM call. Some intelligence can live in the corpus representation. Some can live in lightweight modules. Some can live in retrieval design. The LLM should not be treated as the only component allowed to think.
This is a useful correction to a common misconception: that multi-hop RAG must repeatedly call an LLM because each hop is inherently a reasoning event. CompactRAG suggests a more disciplined view. Some hops require reasoning. Some require retrieval. Some require entity grounding. Some require rewriting a sloppy sub-question so that the retriever does not wander into the woods.
When those jobs are separated, the system becomes less theatrical and more efficient.
That is the part enterprise AI teams should notice. The future of RAG may not be one giant model doing everything in a longer prompt. It may be a pipeline where expensive reasoning is reserved for the places where it is actually needed.
How radical. A system design principle from before everyone decided prompt length was a personality trait.
Conclusion: boring efficiency is still efficiency
CompactRAG is not a claim that multi-hop QA is solved. It is not a universal replacement for iterative RAG. It is not proof that every enterprise should immediately convert its document base into atomic QA pairs and celebrate with a dashboard.
Its contribution is more specific and more useful.
The paper shows that a two-call multi-hop RAG pipeline, supported by offline atomic QA construction and lightweight entity-preserving modules, can achieve competitive benchmark performance while reducing online token consumption substantially. It reframes the cost problem from “how do we make every reasoning hop cheaper?” to “why are we paying the LLM to handle every hop in the first place?”
For businesses, that reframing is the value. Not because it guarantees lower cost in every setting, but because it offers a practical design pattern: move reusable work offline, keep online inference lean, and protect entity grounding with smaller specialized components.
Multi-hop reasoning does not have to be a token bonfire. Sometimes it only looks that way because the architecture keeps handing the matchbox to the most expensive model in the room.
Cognaptus: Automate the Present, Incubate the Future.
-
Hao Yang, Zhiyu Yang, Xupeng Zhang, Wei Wei, Yunjie Zhang, and Lin Yang, “CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering,” arXiv:2602.05728, 2026. https://arxiv.org/html/2602.05728 ↩︎