Fusion Cuisine for RAG: Z‑Scores, Rankers, and the Two‑Source Diet

Retrieval‑augmented generation tends to pick a side: either lean on labeled exemplars (ICL/L‑RAG) that encode task semantics, or on unlabeled corpora (U‑RAG) that provide broad knowledge. HF‑RAG argues we shouldn’t choose. Instead, it proposes a hierarchical fusion: (1) fuse multiple rankers within each source, then (2) fuse across sources by putting scores on a common scale. The result is a simple, training‑free recipe that improves fact verification and, crucially, generalizes better out‑of‑domain.

TL;DR (for busy execs)

Problem: L‑RAG and U‑RAG offer complementary value but their scores aren’t directly comparable; single‑ranker retrieval under‑utilizes signal diversity.
Idea: Fuse rankers per source with Reciprocal Rank Fusion (RRF), then fuse across sources using z‑score standardization of fused scores.
Why it works: RRF reduces single‑model brittleness; z‑scores remove source‑dependent score bias so evidence from labels and from the open corpus can be fairly compared.
Results: Consistent macro‑F1 gains vs. best single ranker/source; stronger OOD performance (e.g., scientific claims) and stable performance across context sizes.
So what: For production RAG, replace “pick one retriever + one source” with many‑rankers × two‑sources + standardize. It’s low‑risk, training‑free, and usually an upgrade.

What HF‑RAG actually does

Stage 1 — Intra‑source fusion (diversity before depth):

For each source (Labeled FEVER train; Unlabeled Wikipedia), retrieve top‑k with multiple rankers: BM25 (sparse), Contriever (bi‑encoder), ColBERT (late interaction), MonoT5 (cross‑encoder reranker).
Fuse the k‑lists into one per source using RRF, which sums reciprocals of ranks so documents that appear high across lists rise to the top.

Stage 2 — Inter‑source fusion (apples to apples):

The two fused lists use different scoring scales.
Put RRF scores from each source on a standard normal via z‑score $(s - \mu)/\sigma$, then merge and take top‑k for the final context fed to the LLM.

Why not just take a fixed mix (e.g., 70% unlabeled, 30% labeled)?

HF‑RAG tried that too (LU‑RAG‑α). Z‑scores win because the best mix is claim‑dependent and the distributions shift by domain—standardization adapts automatically.

How this differs from popular alternatives

RAG‑Fusion / query diversification: modifies the query to induce diverse lists, then merges. HF‑RAG focuses on ranker diversity and source comparability without changing queries.
FiD / Fusion‑in‑Decoder: merges at decoding/training time. HF‑RAG is inference‑only—no new training loops or fine‑tuning.

Results that matter for practitioners

Beats the “best single combo” bound. HF‑RAG outperforms the oracle that picks the best one of {4 rankers × 2 sources} on the test set, showing real complementarity.
OOD generalization: Gains are largest on SciFact (scientific claims), a domain far from FEVER’s training style—evidence that z‑score cross‑source fusion counters overfitting to labeled exemplars.
Stability to context size: Performance stays strong as you vary k (e.g., 5→20), reducing the tuning burden for production.
Adaptive source proportion: On Climate‑FEVER (closer to FEVER), the final context leans more on labeled exemplars; on SciFact, it leans more on unlabeled Wikipedia—without any manual knob‑twiddling.

A compact scorecard

Setting	Takeaway for teams
In‑domain (FEVER)	HF‑RAG > single‑ranker L‑/U‑RAG and > per‑source RRF; easy wins without training.
OOD (Climate‑FEVER)	Similar‑style claims: HF‑RAG taps labeled exemplars more heavily; sustained gains.
OOD (SciFact)	Style/lexicon shift: HF‑RAG relies more on unlabeled corpora; biggest relative gains.
Ablation (LU‑RAG‑α)	Fixed mix underperforms z‑scores—distribution shift makes fixed α brittle.

Where this plugs into your stack

Keep your retrievers plural. Run at least one sparse (BM25), one dense (bi‑encoder), one late‑interaction (ColBERT), and a cross‑encoder reranker. Cost can be managed by using rerankers only on the union of top‑N from the fast models.
Maintain two indices:
- Labeled index (ICL memories): claim–evidence–label tuples (or exemplar snippets) from your supervised datasets.
- Unlabeled index (knowledge): your wiki/docsite/data‑lake.
Fuse per source with RRF, then z‑score across sources. Ship the top‑k merged items to the generator with a slim, role‑aware prompt: labeled items provide schema/label priors; unlabeled items provide factual support.
Log per‑request proportions. Track how many labeled vs unlabeled items land in final contexts; it’s a great drift signal.

Implementation notes & edge cases

Score standardization: If a source’s fused list has <3 items, fall back to robust scaling (median/IQR) or z‑scores with a small‑sample variance correction; otherwise you risk extreme z‑values.
De‑duplication across sources: Different sources may contain near‑duplicates; apply MinHash/SimHash to avoid wasting context tokens.
Guardrails in fact‑checking: Because labeled exemplars carry labels (support/refute/NEI), ensure prompts separate evidence from verdict—make the LLM argue from evidence first, then decide.
Latency budget: Parallelize retrievers; use async fan‑out → union → rerank; cache per‑claim uuids for hot queries.

Strategic take: what this implies for enterprise RAG

Stop chasing one perfect retriever. Diversity + fusion is the new baseline; think ensemble IR.
Treat labeled and unlabeled as complementary views. Labeled gives task semantics; unlabeled gives world context. Standardization is the missing handshake.
Better OOD = lower maintenance. If your domains rotate (compliance today, science tomorrow), HF‑RAG’s adaptive mix is insurance against rebuild churn.

Cognaptus: Automate the Present, Incubate the Future

TL;DR (for busy execs)#

What HF‑RAG actually does#

How this differs from popular alternatives#

Results that matter for practitioners#

A compact scorecard#

Where this plugs into your stack#

Implementation notes & edge cases#

Strategic take: what this implies for enterprise RAG#