TL;DR for operators
A standard RAG system often retrieves the most individually relevant chunks. That is useful until the question needs several different pieces of evidence that must work together. Then the system may return five near-duplicates of the most obvious fact and miss the less obvious fact that actually completes the answer. Excellent. We have reinvented the meeting where everyone brings the same slide.
The paper behind this article, Reinforcing Compositional Retrieval, argues that retrieval should be treated as coordinated selection: the next item retrieved should depend on what has already been retrieved, because the value of a context set is collective, not merely additive.1 Its method, RCR, models retrieval as a sequential decision process, uses a tri-encoder retriever to represent the query, already-selected examples, and candidates separately, then trains the policy first with coverage-oriented supervision and later with reinforcement learning.
What the paper directly shows: on compositional semantic parsing benchmarks such as GeoQuery and COVR-10, RCR improves exact-match performance over top-$k$ and several sequential retrieval baselines, with ablations showing that both the tri-encoder architecture and reinforcement learning stage matter.
What Cognaptus infers for business use: retrieval systems serving compound enterprise questions should be evaluated on evidence coverage, complementarity, and answer sufficiency—not only nearest-neighbour similarity. In compliance, finance, research, technical support, and business intelligence, the expensive failure is rarely “we retrieved nothing.” It is “we retrieved something plausible and incomplete.”
What remains uncertain: the paper is not a production enterprise RAG benchmark. It uses semantic parsing and in-context examples, not messy SharePoint folders, drifting policy documents, or multilingual vendor PDFs written by people who think version control is a lifestyle choice. The lesson transfers as a design principle; the reported numbers do not transfer automatically.
Search fails when the parts compete
Search is a deceptively simple word. A user asks a question, the system retrieves relevant material, the model writes an answer. That is the cheerful diagram. Reality is less polite.
Ask a RAG system, “Which supplier is exposed to both the new EU reporting requirement and the component recall mentioned in last quarter’s engineering review?” The system now needs two classes of evidence: regulatory exposure and product/component linkage. A single vector query may overweight one side. It might return several policy documents about EU reporting, all semantically close to the question, while missing the engineering review that makes the answer operationally useful.
This is not hallucination in the dramatic sense. The model is not necessarily inventing wildly. It is being grounded in an evidence set that is locally sensible and globally insufficient. That distinction matters. Many RAG teams still diagnose answer failure at the generation layer—prompt, model, temperature, response format—when the more boring culprit sits upstream: the retrieved context never contained the necessary composition.
Classical RAG already made the case for connecting language models to non-parametric memory rather than relying only on model weights.2 The next layer of difficulty is not whether external memory helps. It does. The harder question is whether the retrieval layer can assemble the right combination of evidence when the answer depends on interaction among parts.
The misconception: better top-$k$ is not the same as better context
A common instinct is to increase $k$. If five chunks miss the answer, retrieve twenty. If twenty are noisy, add a reranker. If that still fails, upgrade embeddings. Each step may help. None changes the basic assumption: candidates are still mostly judged as individually relevant items.
Compositional retrieval challenges that assumption. The value of a retrieved item depends on what is already in the context. A document that is redundant at step four may have been useful at step one. A document with modest standalone similarity may be essential because it covers the missing structural piece. The retrieval objective therefore shifts from “Which chunks are closest to the query?” to “Which sequence of chunks produces a useful context set?”
The paper formalises this by treating retrieval as conditional selection:
That small change carries a large operational implication. Retrieval is no longer a static ranking problem. It becomes a policy: given the question and the current partial context, choose the next piece that adds the most useful missing information.
This is why the paper’s framing is stronger than the usual “decompose the question” advice. Decomposition is one way to expose missing parts. But RCR goes further: it models dependency among selected examples themselves. The system is not merely breaking the question apart; it is learning how the pieces should accumulate.
The mechanism: a tri-encoder remembers what has already been chosen
RCR uses a tri-encoder architecture. One encoder represents the query. Another represents the context already selected. A third represents candidate examples. That separation is not decorative engineering. It lets the model score candidates based on both the original task and the current retrieval state.
A single encoder over a concatenated query-and-context could blur these roles. A bi-encoder can compare query and candidate efficiently but struggles to represent how previously selected items change the utility of the next one. The tri-encoder gives retrieval a memory of its own sequence without asking a large language model to rescore everything at every step.
The process has two training stages.
First, supervised fine-tuning constructs sequential training examples using local structure coverage. In the semantic parsing setting, the target output is a logical form. The method selects examples that cover useful program structures, then prioritises examples that add structures not yet covered. This creates a retrieval curriculum: pick something useful, then pick something complementary, then pick what remains missing.
Second, reinforcement learning refines the retriever using a task-specific reward. The paper uses local structural similarity between generated and reference programs, then applies Group Relative Policy Optimization (GRPO) to improve the retrieval policy. The key point is not that GRPO is fashionable this quarter. The key point is that the retriever is optimised against downstream usefulness, not merely against similarity labels.
| Component | What it changes | Why it matters |
|---|---|---|
| Sequential retrieval | Each retrieval step depends on previous selections | Reduces redundant context and encourages complementarity |
| Tri-encoder policy | Separates query, selected context, and candidate representation | Makes inter-example dependency easier to model |
| Coverage-based SFT | Initializes retrieval around missing structural parts | Gives RL a sane starting point, rather than asking it to discover everything from chaos |
| GRPO refinement | Aligns retrieval with downstream generation reward | Optimizes for answer-producing context, not just attractive neighbours |
This is the article’s central mechanism: the retriever learns that context quality is a property of a set, and that sets are built one decision at a time.
The evidence says the retrieval policy matters, not just the generator
The experiments focus on compositional generalization in semantic parsing. GeoQuery asks natural-language geography questions that must be translated into logical forms. COVR-10 is a synthetic benchmark built around compositional visual-style reasoning structures. These are not enterprise document QA tasks, but they are useful stress tests because the target output depends on recombining familiar structures in unfamiliar ways.
The paper reports exact-match accuracy across multiple GeoQuery splits and COVR-10. RCR reaches 78.21 on the GeoQuery i.i.d. split, 59.64 on Template 1, 33.83 on Template 2, 48.48 on Template 3, and 36.18 average on COVR-10. More important than any single number is the pattern: the full RCR system outperforms most top-$k$ and sequential retrieval baselines, while ablations show that removing either supervised initialization or reinforcement learning weakens the method.
The RL comparison is especially informative. On GeoQuery, GRPO performs best across the reported splits compared with several alternative advantage estimators. For example, on Template 2, RCR without RL scores 28.39, while GRPO reaches 33.83. On Template 1, the gap is 58.73 versus 59.64. The gains are uneven, which is exactly what one should expect if reinforcement learning is polishing a retrieval policy rather than magically replacing the need for good initial coverage.
The more revealing result is that direct RL without supervised initialization is weaker. That matters for operators because it punctures a convenient fantasy: reward optimization alone does not rescue a poorly structured retrieval system. The retriever needs a useful prior over what “coverage” means before downstream rewards can refine it. Otherwise, the system spends too much time learning not to trip over the furniture.
Multi-hop RAG has the same disease in a less controlled body
Although RCR is tested on semantic parsing, its diagnosis aligns with broader RAG evidence. MultiHop-RAG was built precisely because standard RAG benchmarks often underrepresent questions requiring multiple supporting documents, and its authors found existing RAG systems inadequate on multi-hop queries.3 EfficientRAG similarly starts from the observation that multi-hop questions strain conventional RAG and proposes iterative retrieval without repeated LLM calls at every step.4
Question decomposition work takes a related path: split complex questions into subqueries, retrieve for each, merge candidates, then rerank to recover precision.5 That approach is closer to many enterprise deployments because it can be added around existing retrievers. But it introduces its own risk. If decomposition loses a key entity, the retrieval chain can fail before the generator ever sees the right evidence. The “lost-in-retrieval” literature makes this failure explicit: sub-question decomposition can omit key entities, degrading retrieval and breaking the reasoning chain.6
RCR sits in this family but emphasizes a different control point. It is less about writing better subquestions and more about learning better context composition. In business language, it is the difference between assigning tasks to departments and managing the dependencies between their deliverables. The first is decomposition. The second is coordination. Only one of them prevents the final answer from arriving with three copies of the same spreadsheet and no legal review.
The business value is fewer missing pieces, not prettier answers
For enterprise AI, the attractive promise is not “more intelligent RAG.” That phrase now means everything and therefore almost nothing. The practical promise is narrower: fewer incomplete answers on compound questions.
Consider four common enterprise patterns.
| Workflow | Why standard retrieval struggles | What compositional retrieval suggests |
|---|---|---|
| Compliance review | Evidence may span regulation, policy, jurisdiction, product, and date | Optimize retrieval for coverage across required evidence categories |
| Financial analysis | The answer may require linking filings, market data, segment notes, and prior guidance | Penalize redundant retrieval and reward complementary source selection |
| Technical support | A fix may depend on product version, environment, error trace, and known incident history | Retrieve stepwise so later retrieval targets what earlier retrieval did not cover |
| Research synthesis | Multiple studies may address different mechanisms, populations, or methods | Build context sets around argumentative completeness, not keyword overlap |
This does not mean every query deserves a sequential retriever. Most enterprise questions are boring, and boring is where money is made. “What is the leave policy?” should not trigger a five-step retrieval ballet. But compound, high-stakes, multi-document questions should be routed differently.
The operational design pattern is straightforward:
- Classify the query as single-hop, multi-hop, comparative, temporal, or synthesis-oriented.
- Use simple retrieval for simple queries.
- Use decomposition, iterative retrieval, or compositional retrieval for compound queries.
- Evaluate the retrieved context before generation: does it cover the necessary entities, time periods, jurisdictions, or evidence types?
- Measure answer failure by missing evidence categories, not only answer correctness after the fact.
The uncomfortable part is evaluation. Most organisations still test RAG by reading sample answers and deciding whether they “look right.” This is charming, in the same way inspecting a bridge by tapping it with a shoe is charming. For compositional workflows, teams need retrieval-level diagnostics: coverage, redundancy, missing-hop rate, evidence diversity, and whether retrieved chunks collectively support the answer.
What the paper directly shows, and what it does not
A disciplined reading separates evidence from extrapolation.
| Layer | Directly shown by the paper | Cognaptus inference | Boundary |
|---|---|---|---|
| Retrieval objective | Sequential, dependency-aware retrieval improves performance on tested compositional semantic parsing tasks | Enterprise RAG should treat context assembly as a coordination problem | Not yet proven on real corporate document stores |
| Model architecture | Tri-encoder retrieval helps model query, selected examples, and candidates separately | Retrieval systems may need explicit state, not just stronger embeddings | Architecture may be unnecessary for simpler QA |
| Training strategy | Coverage-based SFT plus RL outperforms weaker ablations | Reward-driven retrieval works best when initialized with domain-relevant structure | Requires a measurable notion of coverage or downstream reward |
| Practical efficiency | The method uses a small number of retrieval steps and discusses linear scaling in steps | Selective routing can contain cost | Dynamic corpora and frequent index updates remain harder |
The limitation section of the paper is unusually relevant for deployment. The authors note that the retriever cannot dynamically incorporate new entries after training in the same way a standard index can. They also note that more retrieval steps increase cost and instability due to compounding errors. These are not small details. In an enterprise setting, documents change constantly. Contracts are amended. Policies are superseded. Someone uploads final_final_v7.pdf because civilisation remains fragile.
So the business lesson is not “train RCR and replace your vector database.” It is: stop assuming top-$k$ similarity is the correct abstraction for compound questions. Use RCR as evidence that retrieval policies can be trained around context complementarity, then decide how much of that idea your system can afford.
A practical adoption path: start with diagnostics before architecture
The lowest-risk implementation is not to build a tri-encoder retriever on Monday. It is to instrument current RAG failures.
Start by sampling failed answers and labeling the retrieval defect:
| Failure type | Diagnostic question |
|---|---|
| Missing entity | Did the retrieved context omit one of the entities required by the question? |
| Missing relation | Did it retrieve the entities but miss the relationship connecting them? |
| Redundant context | Are the top chunks mostly paraphrases of the same fact? |
| Temporal mismatch | Did it retrieve correct evidence from the wrong period? |
| Source-type imbalance | Did it over-retrieve policies while missing transactions, tickets, filings, or emails? |
| Composition failure | Was all evidence present, but the model failed to combine it correctly? |
Only after that diagnosis should architecture change. If failures are mostly redundancy and missing categories, use query decomposition plus reranking. If failures are multi-step and dependency-heavy, test iterative retrieval. If the workflow has stable task structure and measurable downstream outcomes, consider training a compositional retriever or policy layer.
That last condition matters. RCR’s reward depends on structural similarity between generated and reference programs. Enterprise workflows rarely have such clean labels. Some do: SQL generation, code retrieval, form completion, compliance checklist mapping, contract clause extraction, claims adjudication. Others do not. For open-ended strategy questions, the reward signal is messier and may require human review, weak supervision, or rubric-based evaluation. Nobody escapes evaluation. They merely postpone it until the system embarrasses them in production.
The future of RAG is not bigger context alone
Longer context windows are useful. Better embeddings are useful. Rerankers are useful. But none of them fully solves the coordination problem. A larger context window can hold more documents; it does not guarantee the system selected the right mixture. Bigger buckets do not make the ingredients better. They just make the soup harder to audit.
Compositional retrieval points toward a more mature architecture: retrieval systems that know what has already been covered, what remains missing, and whether the next document adds new evidence or merely flatters the similarity score. This is less glamorous than agentic fireworks. It is also closer to what production AI needs.
The paper’s real contribution is not that it makes RAG “smarter” in some vague conference-demo sense. It changes the unit of retrieval quality from the individual chunk to the assembled context. That is the right unit for many business questions because decisions are rarely made from one perfect paragraph. They are made from fragments that must be brought into the same room, forced to acknowledge each other, and arranged into something defensible.
RAG began as a way to connect language models to external knowledge. Compositional retrieval asks the next, less comfortable question: once the model can retrieve knowledge, can it retrieve the right combination of knowledge?
For operators, that is where the work begins.
Cognaptus: Automate the Present, Incubate the Future.
-
Quanyu Long, Jianda Chen, Zhengyuan Liu, Nancy F. Chen, Wenya Wang, and Sinno Jialin Pan, “Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts,” arXiv:2504.11420, 2025, https://arxiv.org/abs/2504.11420. ↩︎
-
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” arXiv:2005.11401, 2020, https://arxiv.org/abs/2005.11401. ↩︎
-
Yixuan Tang and Yi Yang, “MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries,” arXiv:2401.15391, 2024, https://arxiv.org/abs/2401.15391. ↩︎
-
Ziyuan Zhuang, Zhiyang Zhang, Sitao Cheng, Fangkai Yang, Jia Liu, Shujian Huang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang, “EfficientRAG: Efficient Retriever for Multi-Hop Question Answering,” arXiv:2408.04259, 2024, https://arxiv.org/abs/2408.04259. ↩︎
-
Paul J. L. Ammann, Jonas Golde, and Alan Akbik, “Question Decomposition for Retrieval-Augmented Generation,” arXiv:2507.00355, 2025, https://arxiv.org/abs/2507.00355. ↩︎
-
Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, and Wei Hu, “Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering,” arXiv:2502.14245, 2025, https://arxiv.org/abs/2502.14245. ↩︎