A RAG system usually fails in one of two annoyingly familiar ways.

It retrieves documents that are factually relevant but gives the model no clue about the task’s decision boundary. Or it retrieves labelled examples that show the decision pattern but are too parochial to help when the topic drifts. One source knows the world. The other knows the exam rubric. Naturally, many systems pick one and then pretend the compromise was strategy.

HF-RAG, proposed in HF-RAG: Hierarchical Fusion-based RAG with Multiple Sources and Rankers, is a small but useful reminder that retrieval architecture matters before the LLM ever begins to sound clever.1 The paper’s central move is not to invent a new generator, train another adapter, or baptise a new agentic loop. It asks a cleaner question: if labelled examples and unlabelled documents are both useful, how do we combine them without letting incompatible retrieval scores sabotage the context?

The answer is a two-stage fusion recipe. First, use several rankers within each source and fuse their outputs. Then, standardise the resulting labelled and unlabelled scores before merging the two lists. In other words: do not just throw every ingredient into the pot. Rank within each pantry. Calibrate across pantries. Then feed the model.

The real problem is not retrieval volume; it is score comparability

RAG discussions often obsess over how much context to retrieve. More passages, bigger windows, longer prompts, grander invoices. HF-RAG focuses on a more basic operational problem: two useful retrieval sources may not speak the same scoring language.

The paper treats two sources as complementary:

Source What it contributes What it can miss
Labelled examples Task semantics: similar claims, evidence patterns, and labels such as support, refute, or not-enough-information Broader factual coverage when the domain shifts
Unlabelled corpus documents Factual grounding from a wider knowledge source The task-specific mapping from evidence to verdict

In their experiments, the labelled source is the FEVER training set: claim-evidence-label examples. The unlabelled source is the 2018 Wikipedia dump used as the document collection for FEVER. The task is fact verification, where a model must classify a claim as supported, refuted, or not sufficiently evidenced.

This distinction matters because labelled retrieval and unlabelled retrieval are not merely two document collections. They are two different kinds of signal. A labelled example can teach the model how the task behaves. An unlabelled article can supply the facts that the labelled set never contained. In a stable domain, the labelled examples may carry a lot of weight. Under domain shift, they may become a beautifully organised memory of the wrong neighbourhood.

The paper’s motivating example is a climate claim about global warming and polar bear extinction. A labelled FEVER example about brown bears nearing extinction does not directly prove the claim, but it gives the model a task-shaped analogy. A Wikipedia passage about global warming provides broader factual context. Neither source is enough by itself. Together, they form a more useful context—provided the system can decide how much of each to include.

That last clause is where the engineering begins.

HF-RAG fuses rankers before it fuses sources

The first layer of HF-RAG is intra-source rank fusion. For each source—labelled and unlabelled—the system retrieves candidates using multiple rankers:

  • BM25, a sparse lexical retriever;
  • Contriever, a dense bi-encoder;
  • ColBERT, a late-interaction dense retriever;
  • MonoT5, a cross-encoder reranking pipeline using BM25 as the initial ranker.

Each retriever returns its own top candidates. HF-RAG then combines these ranker outputs separately within the labelled source and within the unlabelled source using Reciprocal Rank Fusion, or RRF.

The practical idea is simple. A document that appears near the top across several different retrieval models deserves attention. A document loved by only one ranker may still matter, but it should not automatically dominate the context because one scoring function had a moment.

The paper expresses this as an RRF-style score over each source:

$$ \theta_C(d)=\sum_{\theta \in \Theta}\frac{1}{\operatorname{rank}(L^{C,\theta}_k,d)} $$

Here, $C$ is the source, either labelled or unlabelled; $\Theta$ is the set of retrievers; and $L^{C,\theta}_k$ is the top-$k$ list returned by retriever $\theta$ for source $C$. If a document does not appear in a retriever’s list, it receives a very large rank, which makes its contribution small.

This first stage answers one problem: single-ranker brittleness. Sparse retrieval catches exact lexical overlap. Dense retrieval can catch semantic similarity. Late interaction and cross-encoder reranking bring their own biases and strengths. HF-RAG does not ask the team to crown one universal winner. Sensible. Universal winners in retrieval are mostly a procurement fantasy.

But after this stage, the system still has two fused lists: one labelled, one unlabelled. RRF has made rankers comparable within each source. It has not made sources comparable with each other.

Z-scores are the handshake between labelled and unlabelled evidence

HF-RAG’s second layer is inter-source fusion. This is where the paper earns its keep.

The labelled and unlabelled lists are non-overlapping. That makes another round of RRF unsuitable: RRF works by aggregating ranks for shared candidates across lists, but the candidates here come from different collections. A FEVER labelled example and a Wikipedia paragraph are not the same object. Their scores also live on different distributions.

HF-RAG therefore standardises the fused score distribution within each source using a z-score:

$$ \phi(\theta_C(d))=\frac{\theta_C(d)-\mu_C}{\sigma_C} $$

The merged context is then selected by taking the top-$k$ documents after putting both source-specific score distributions on this common scale.

This is the paper’s mechanism-first lesson. The clever part is not “use labelled and unlabelled data”, because many teams already know both are useful. The clever part is recognising that usefulness does not imply direct comparability. A high score in the labelled list and a high score in the unlabelled list may not mean the same thing. Z-score standardisation turns each document’s score into a relative measure: how strong is this candidate compared with other candidates from its own source?

That does not make z-scores magical. They are a calibration heuristic, not a divine revelation from the statistics department. The paper motivates the method through score-distribution assumptions used in information retrieval and preference comparison settings, but it does not prove that every production corpus will behave nicely. Still, the operational point is strong: source fusion needs a common scale, and a fixed source ratio is a blunt substitute.

This is not RAG-Fusion, and not Fusion-in-Decoder

A likely misreading is to file HF-RAG under the general “fusion” bucket and move on. That would be lazy, though pleasantly on-brand for many architecture diagrams.

HF-RAG is not RAG-Fusion in the query-diversification sense. RAG-Fusion methods typically generate query variants, retrieve multiple result lists, and merge them to improve coverage or diversity. HF-RAG does not primarily modify the query. Its focus is on combining multiple rankers and multiple evidence sources.

It is also not Fusion-in-Decoder. FiD-style approaches combine retrieved passages inside the decoder, often with training or task-specific adaptation. HF-RAG’s fusion happens before generation and is designed as an inference-time retrieval pipeline. That makes it attractive for teams that cannot, or do not want to, keep fine-tuning every time their domain shifts.

The distinction is not academic bookkeeping. It changes the business interpretation:

Reader belief Correction Why it matters
“Fusion means generating more query variants.” HF-RAG fuses rankers and sources, not query rewrites. The lever is retrieval architecture, not prompt expansion.
“Fusion means the generator learns how to combine evidence.” HF-RAG combines evidence before the generator sees it. Deployment can be training-free.
“Just choose the best retriever on validation.” HF-RAG shows multiple rankers can outperform single-ranker selection. Validation winners can be brittle under domain shift.
“Use a fixed labelled/unlabelled ratio.” Z-score fusion adapts the mixture claim by claim. Fixed ratios can underperform when domains shift.

That is the real business shape of the paper: less romance about smarter generators, more discipline about what gets placed in front of them.

The main evidence: macro-F1 improves, especially under domain shift

The experiments use FEVER as the in-domain dataset and Climate-FEVER and SciFact as out-of-domain evaluations. The models are trained or configured around FEVER and then tested on climate-related and scientific claims. The paper evaluates with macro-F1, which is appropriate for a three-class fact-verification setup where class balance and minority performance matter.

The main results compare HF-RAG against supervised fine-tuning baselines, zero-shot LLM prediction, labelled-only RAG, unlabelled-only RAG, per-source RRF variants, a fixed-ratio labelled/unlabelled ablation, and an oracle-style single-source/single-ranker bound called RAG-OptSel.

A compact reading of the results is below.

Setting HF-RAG macro-F1 Most relevant comparator Interpretation
FEVER / Llama 0.5744 0.5468 RAG-OptSel bound HF-RAG beats even the test-label-assisted single-source/single-ranker bound.
FEVER / Mistral 0.5628 0.5584 RAG-OptSel bound Gain is small but still positive against the bound.
Climate-FEVER / Llama 0.4838 0.4815 LU-RAG-$\alpha$ HF-RAG narrowly leads; this is not the dramatic case.
Climate-FEVER / Mistral 0.5019 0.5249 U-RAG-RRF HF-RAG does not win here; unlabelled multi-ranker fusion is stronger.
SciFact / Llama 0.4320 0.4012 U-RAG-RRF Clearer gain under stronger domain shift.
SciFact / Mistral 0.4341 0.4246 RAG-OptSel bound HF-RAG again exceeds the single-source/single-ranker bound.

The most important result is not that HF-RAG wins every cell. It does not. Climate-FEVER with Mistral is the obvious exception: U-RAG-RRF reaches 0.5249, while HF-RAG reaches 0.5019. The grown-up interpretation is that source fusion is useful but not automatically superior to the best single-source multi-ranker setup in every condition.

The more interesting pattern is where HF-RAG helps most. SciFact is farther from FEVER than Climate-FEVER is, both in domain and claim style. On SciFact, HF-RAG produces the clearest improvements: 0.4320 with Llama and 0.4341 with Mistral. That supports the paper’s claim that combining task-shaped labelled examples with broader unlabelled knowledge can improve out-of-domain generalisation.

The supervised fine-tuning baselines also provide context. RoBERTa, LoRA, and CORRECT remain below the HF-RAG results across the reported datasets. This does not mean fine-tuning is obsolete. Please do not print that on a slide. It means that, for this fact-verification setup, retrieval-side fusion offered better macro-F1 than the selected parameter-updating baselines and labelled-only prompt-encoder approach.

The ablation says fixed source recipes are too stiff

The key ablation is LU-RAG-$\alpha$, which mixes labelled and unlabelled candidates using a fixed proportion controlled by $\alpha$. The parameter is optimised using FEVER training data.

This ablation tests whether HF-RAG’s z-score fusion is actually useful, or whether the system could simply decide in advance to take, say, some labelled examples and some unlabelled documents. The answer is fairly unkind to fixed recipes. HF-RAG outperforms LU-RAG-$\alpha$ across the reported table.

That matters because fixed ratios are tempting in production. They are easy to explain, easy to configure, and easy to forget until they quietly become wrong. The problem is that the right mixture is query- and domain-dependent. A claim close to the labelled training distribution may benefit from examples that encode the task’s label logic. A claim from a distant scientific domain may need more external factual grounding.

HF-RAG’s z-score fusion is not “adaptive” in the sense of learning a policy from feedback. There is no reward loop. No bandit updates. No agent discovering wisdom through invoices. But it does adapt at inference time to the relative score distributions produced for a given claim. That is enough to make it more flexible than a static labelled/unlabelled split.

The robustness tests support the mechanism, but do not expand the thesis

The paper’s figures beyond the main table serve two different purposes.

First, Figure 3 is a comparison between retrieval quality and downstream fact-verification performance for unlabelled RAG variants. It uses nDCG@10 as the retrieval metric and F1 as the task metric. The purpose is not to prove that retrieval quality perfectly determines generation quality. It does not. The purpose is narrower: to show that multi-ranker fusion improves retrieval quality across the three datasets and that this tends to accompany downstream F1 gains.

That supports HF-RAG’s first stage. If better fused retrieval produces stronger candidate contexts, then RRF is not just decorative plumbing.

Second, Figure 4a is a sensitivity test on SciFact across different numbers of retrieved examples. HF-RAG is reported as more stable than the labelled-only and unlabelled-only variants and their per-source RRF versions. The practical implication is modest but valuable: teams may face less tuning fragility around context size. The boundary is equally important: the paper tests a small set of context sizes, not every prompt budget, corpus shape, or model context window.

Third, Figure 4b examines the relative proportion of labelled and unlabelled items in HF-RAG’s final context when $k=10$. The authors interpret the pattern as domain-sensitive: Climate-FEVER, being closer to FEVER in claim length and linguistic style, leans more on labelled examples; SciFact, being less aligned with FEVER, relies more on external knowledge. This figure is best read as explanatory evidence for the mechanism, not as a universal law of labelled/unlabelled balance.

Here is the clean evidence map:

Test Likely purpose What it supports What it does not prove
Table 1 macro-F1 Main evidence HF-RAG is competitive and often best across FEVER, Climate-FEVER, and SciFact. That it will dominate every domain or every generator.
LU-RAG-$\alpha$ Ablation Z-score fusion is better than a fixed labelled/unlabelled proportion in this setup. That z-scores are always the best calibration method.
RAG-OptSel Comparison bound HF-RAG can beat the best single-ranker/single-source configuration selected with test labels. That it always beats multi-ranker single-source systems.
Figure 3 nDCG/F1 Mechanism support Better retrieval from ranker fusion tends to align with downstream gains. That retrieval metrics fully predict generation outcomes.
Figure 4a context size Robustness/sensitivity test HF-RAG is relatively stable over tested context sizes on SciFact. That context size no longer needs tuning.
Figure 4b source mix Exploratory mechanism evidence HF-RAG changes labelled/unlabelled composition by dataset. That the observed proportions generalise to all corpora.

This is the difference between reading the paper and merely admiring the abstract from a safe distance.

The business value is retrieval resilience, not “better RAG” in the abstract

For enterprise systems, the useful lesson is not “use HF-RAG exactly as written.” The useful lesson is that many RAG stacks are under-designed at the retrieval layer.

A common enterprise setup has both curated labelled cases and large unlabelled document stores. In legal review, there may be previous matter examples plus policy documents. In compliance, there may be labelled decisions plus regulation text. In customer support, there may be resolved tickets plus a knowledge base. In finance, there may be analyst-labelled cases plus filings and market commentary.

HF-RAG suggests a practical pathway:

  1. Build a labelled exemplar index for task semantics.
  2. Build an unlabelled knowledge index for factual grounding.
  3. Query both sources with multiple retrievers.
  4. Fuse rankers within each source using a rank-based method such as RRF.
  5. Standardise source-specific fused scores before merging.
  6. Log the final source proportions as a drift and diagnosis signal.

That last point is underrated. If a system suddenly starts leaning heavily on unlabelled documents, maybe the incoming requests have drifted away from the labelled training distribution. If it leans too heavily on labelled examples, maybe the corpus lacks fresh factual coverage. The retrieval mix becomes a monitoring surface, not just a hidden implementation detail.

The ROI pathway is therefore not only higher benchmark F1. It is reduced dependence on one retriever, reduced pressure to fine-tune for every domain shift, and a more inspectable retrieval pipeline. The model still has to reason over the context. The evidence still has to be governed. But the input diet becomes less brittle.

Where teams should be careful

HF-RAG is attractive because it is training-free at inference time. That does not make it free.

Running four retrievers across two sources has latency and infrastructure cost. The paper does not provide a production latency or cost analysis. A practical version would likely parallelise retrieval, use cheaper first-stage retrievers, reserve cross-encoder reranking for smaller candidate pools, and cache aggressively for repeated queries.

The method also assumes that score standardisation is meaningful over the retrieved lists. Very small candidate sets, highly skewed score distributions, near-duplicates, or source-specific boilerplate can distort z-scores. In production, robust scaling, deduplication, and source-level quality filters may matter as much as the fusion formula. Naturally, these are the parts that do not look glamorous in a benchmark table and then decide whether the system survives contact with users.

There is also a prompt-governance issue. Labelled examples contain labels. If the generator treats labels as verdict templates rather than analogical guidance, it may over-copy decision patterns from superficially similar cases. For fact verification, the prompt should clearly separate labelled exemplars from unlabelled evidence and require the model to ground the final decision in the retrieved support or refutation evidence.

Finally, the paper’s evidence is bounded. It studies fact verification across FEVER, Climate-FEVER, and SciFact; uses two generators, LLaMA 2.0 70B and Mistral 7B; reports macro-F1; and evaluates a particular set of retrievers and sources. It does not establish performance for open-ended enterprise question answering, regulated workflows, multimodal retrieval, private document permissions, adversarial retrieval, or live production drift. Those are not footnotes. They are the difference between a good research mechanism and a deployable system.

The takeaway: fusion is a calibration problem before it is a generation problem

HF-RAG is useful because it shifts attention from the model’s final answer to the retrieval decisions that shape the answer before generation begins.

Its contribution is not that labelled examples are useful. We knew that. It is not that unlabelled corpora are useful. We knew that too. The contribution is the hierarchy: fuse diverse rankers within each source, then make labelled and unlabelled evidence comparable before selecting the final context.

That mechanism explains the results better than a generic “RAG improves with more evidence” story. More evidence can help. It can also confuse, bias, duplicate, or overfill the prompt. HF-RAG’s lesson is narrower and sharper: when evidence comes from different sources, the system needs a fair way to compare it.

In enterprise terms, this is a recipe for retrieval resilience. Not a universal fix. Not a product category. Not yet another excuse to draw a box called “agent.” Just a disciplined way to stop making RAG choose between the case file and the library.

Sometimes that is enough. The best systems are often not the ones with the most elaborate reasoning loop. They are the ones that feed the model the least foolish context.

Cognaptus: Automate the Present, Incubate the Future.


  1. Payel Santra, Madhusudan Ghosh, Debasis Ganguly, Partha Basuchowdhuri, and Sudip Kumar Naskar, “HF-RAG: Hierarchical Fusion-based RAG with Multiple Sources and Rankers,” arXiv:2509.02837, 2025. ↩︎