When RAG Needs Provenance, Not Just Recall: Traceable Answers Across Fragmented Knowledge

RAG has a public-relations problem. It promises grounded answers, then quietly assumes that “grounded” means “retrieved from somewhere nearby.”

That assumption is convenient. It is also the kind of convenience that creates compliance incidents, medical confusion, and internal knowledge assistants that cite the wrong document with absolute confidence. A retrieval-augmented system can answer from evidence and still choose the wrong evidence. It can cite something real and still fail provenance.

The paper behind this article, Traceable Cross-Source RAG for Chinese Tibetan Medicine Question Answering, studies that less glamorous but more operationally important failure mode.¹ Its domain is Chinese Tibetan medicine. Its broader lesson is enterprise RAG: when knowledge is fragmented across sources with different authority, density, language style, and evidentiary role, retrieval is not just a similarity problem. It is a governance problem wearing a vector database costume.

The paper’s central question is simple: how should a RAG system answer when the relevant knowledge is split across an encyclopedia, classical medical texts, and clinical papers?

The wrong answer is also simple: merge everything into one index and let the retriever sort it out.

That is how many enterprise RAG systems are built. It is also how the trouble starts.

The first failure is density bias, not hallucination

Most discussions of RAG failure begin with hallucination. That is understandable, but slightly too theatrical. In many practical systems, the first failure happens earlier: the retriever selects plausible but epistemically weaker evidence.

In the paper’s Tibetan-medicine setting, the three knowledge bases play different roles:

Knowledge base	Typical role	Retrieval risk
Encyclopedia entries	Dense summaries and accessible definitions	Easy to match, likely to dominate retrieval
Classical texts	Doctrinal and conceptual authority	Harder language, less direct matching
Clinical papers	Empirical or modern clinical evidence	Longer, more fragmented, uneven matching

The key asymmetry is not merely topic coverage. It is texture. Encyclopedia passages are short and information-dense. Vector retrievers like short, dense, semantically obvious chunks. They do not automatically know that a classical source may be more authoritative for a doctrinal question, or that a clinical paper may be more appropriate for an evidence-based question.

So the system can retrieve “relevant” text while still failing at source selection.

This is a useful correction to the usual business belief about RAG. The problem is not only whether the system finds related text. The problem is whether it finds the right kind of related text.

In legal, finance, compliance, insurance, healthcare, procurement, and internal-policy systems, the same pattern appears constantly. A short FAQ can beat a policy manual. A slide deck can beat a contract. A product page can beat a technical standard. The retriever is not malicious. It is merely indifferent to institutional authority. A deeply relatable flaw, though not one we should deploy into production unsupervised.

DAKS treats retrieval as source-budget allocation

The paper’s first method, DAKS, addresses this problem before the final evidence list is even formed. Instead of retrieving from a single merged corpus, the system keeps the knowledge bases partitioned and performs lightweight probe retrieval within each one.

For each knowledge base, it observes score patterns such as peak relevance, score concentration, margin, and a coverage proxy. It then combines these with an authority prior and allocates a retrieval budget across sources.

The important move is conceptual: DAKS treats retrieval as budgeted resource selection.

That matters because source-level decisions should not be smuggled into chunk-level similarity scores. A chunk can be highly matchable because it is dense, not because it is the best evidence. A source can be authoritative even if its language is harder to match. DAKS creates a place in the architecture where those differences can be expressed.

The paper’s routing results are modest but meaningful:

Method	PrimaryAcc	Top2Hit	Encyclopedia dominance rate	AuthCov
Uniform per-KB budget	0.500	0.680	0.415	0.180
Merged KB	0.480	0.640	0.485	0.140
DAKS	0.560	0.700	0.362	0.230

This table should not be read as “DAKS solves routing.” It does not. Primary-source accuracy rises to 0.560, which is useful but not magical. The stronger interpretation is that explicit source routing reduces the encyclopedia’s ability to crowd out other sources.

That is the operational lesson. A RAG system does not need perfect source routing for source routing to be valuable. It needs enough source awareness to prevent the easiest-to-retrieve source from becoming the de facto truth authority.

For business systems, that is already a large improvement over the usual “one index to rule them all” approach.

The second failure is noisy fusion

Better routing is necessary, but not sufficient. Once evidence candidates arrive from multiple knowledge bases, the system still has to decide how to combine them.

The naive solution is concatenation: take retrieved passages from several sources, place them into the prompt, and let the model decide. This sounds democratic. It is mostly just passing responsibility downstream.

The paper argues that naive multi-source fusion creates two problems.

First, it introduces noise. More passages do not automatically mean better grounding. They may dilute the signal.

Second, passage order matters. Long-context models are sensitive to where evidence appears. Relevant material can be underused if it sits in an inconvenient position. This is not a philosophical defect. It is a prompt-layout problem with business consequences.

The paper’s second method, alignment graph-guided fusion, tries to organize the evidence before it reaches the generator. It builds a bipartite graph between chunks and typed entities such as diseases, symptoms, drugs, and formulas. This graph supports three functions:

finding bridge evidence across knowledge bases;
scoring chunks using entity overlap and graph proximity;
packing evidence under a token budget while ensuring required source coverage.

This last step is easy to underestimate. Evidence packing is not just compression. In this paper, it is a control mechanism. The system first satisfies coverage constraints across required knowledge bases, then fills remaining budget by score while applying diversity caps.

That is the difference between “here are some passages” and “here is an evidence set designed to support cross-source verification.”

The main evidence: provenance improves, but components depend on each other

The paper evaluates on a 500-query benchmark balanced across four question types: definitions, classical principles, clinical evidence, and cross-KB synthesis. The system uses three separate knowledge bases and generates all answers with openPangu-Embedded-7B. The authors report metrics for faithfulness, context precision and recall, answer relevance, citation correctness, and cross-KB evidence coverage.

The most important metric for the paper’s thesis is CrossEv@5: whether the top five evidence chunks cover all required knowledge bases for cross-source questions.

The end-to-end results are:

Method	Faithfulness	Context precision	Context recall	Answer relevance	CrossEv@5	Citation correctness
Single-KB encyclopedia	0.654	0.415	0.834	0.902	0.720	0.756
Single-KB classics	0.837	0.223	0.734	0.833	0.750	0.791
Single-KB clinical papers	0.810	0.245	0.705	0.805	0.680	0.680
Merged KB	0.785	0.195	0.750	0.795	0.650	0.720
Naive multi-KB concat	0.750	0.180	0.745	0.780	0.620	0.700
DAKS only	0.820	0.235	0.760	0.825	0.720	0.750
GraphFusion only	0.630	0.306	0.409	0.408	0.650	0.650
DAKS + GraphFusion	0.805	0.265	0.720	0.810	0.780	0.760

The headline is not that every metric improves. It does not. The full system has the best CrossEv@5 at 0.780, compared with 0.650 for the merged KB baseline and 0.620 for naive multi-KB concatenation. It also keeps faithfulness and citation correctness competitive.

The more interesting result is the interaction between components. DAKS alone improves the system more cleanly than GraphFusion alone. GraphFusion without good routing performs poorly on answer relevance, even though graph-based evidence logic is supposed to help.

That is an important warning. Graph-guided fusion is not a decorative module that can be bolted onto bad retrieval. It depends on the candidate pool. If upstream routing has already chosen weak or mismatched evidence, the graph has less useful material to organize.

This is the paper’s best business-relevant point: provenance systems are pipelines, not plugins.

A compliance citation layer cannot rescue a sloppy index. A knowledge graph cannot fully repair source selection bias. A long-context model cannot be expected to infer institutional authority from a pile of mixed passages. The architecture must preserve evidence quality at every stage.

The evidence-level tests show what the graph is actually doing

The paper separately evaluates evidence fusion on cross-KB queries. This is best read as an ablation and mechanism test, not a second thesis.

Evidence fusion method	EvRecall@5	EvNDCG@5	CrossEv@5	Encyclopedia dominance rate
Naive concat, fixed order	0.834	0.542	0.315	0.485
Score-only rerank, no graph	0.846	0.612	0.392	0.442
Graph-support fusion	0.851	0.625	0.582	0.365
Graph retrieval + fusion	0.872	0.693	0.645	0.341

Here the alignment graph’s role becomes clearer. It improves evidence ranking and cross-KB coverage while reducing encyclopedia dominance. The largest movement is in CrossEv@5: graph-support fusion raises it from 0.392 under score-only reranking to 0.582, and graph retrieval plus fusion raises it further to 0.645.

That does not prove the generated answers are universally better. It supports a narrower claim: graph-guided fusion helps construct evidence lists that cover the required sources more reliably.

For an enterprise reader, that distinction matters. Evidence coverage is not the same thing as final answer quality. But in audited workflows, evidence coverage is often a prerequisite. A system that cannot present the right kinds of evidence cannot be trusted to justify its answer, even when the answer happens to be correct.

What this means for enterprise RAG design

The paper is about Tibetan medicine, but the design pattern travels well.

Many organizations have knowledge environments that look structurally similar:

Enterprise setting	Fragmented sources	Typical failure
Legal operations	Contracts, policy memos, regulatory text, case notes	Short internal summaries outrank binding documents
Finance and audit	Accounting policies, transaction records, controls, external standards	Convenient explanations outrank formal control evidence
Healthcare administration	Clinical guidelines, payer rules, patient instructions, research summaries	Patient-facing text outranks clinical authority
Customer support	FAQs, product manuals, release notes, ticket histories	Popular support snippets outrank current documentation
Procurement	Supplier contracts, pricing sheets, compliance requirements	Searchable commercial documents outrank contractual constraints

In each case, the retrieval problem is not just semantic. It is institutional. Sources have roles. Some define rules. Some summarize rules. Some provide evidence. Some are outdated but easy to match. Some are authoritative but linguistically awkward.

A mature RAG architecture should therefore include at least four controls:

Control	Technical equivalent in the paper	Business purpose
Keep sources partitioned	Separate KB indexing	Prevent source identity from disappearing inside one vector pool
Score sources before chunks dominate	DAKS routing	Reduce density-driven bias
Use structured links across sources	Alignment graph	Connect terminology, entities, and evidence across document families
Pack evidence with coverage constraints	Coverage-aware evidence packing	Make citations auditable, not merely available

The ROI case is not “better chatbot answers.” That is too soft.

The practical value is cheaper diagnosis of answer failures. When a RAG answer is wrong, the organization needs to know whether the failure came from missing documents, wrong source selection, poor chunk ranking, weak cross-source alignment, bad prompt packing, or generator misuse. A flat index hides those failure modes. A source-aware architecture exposes them.

That is the quiet business value of provenance: it turns RAG from a black box with citations into a system whose mistakes can be traced.

What the paper directly shows, and what it only suggests

The paper directly shows that, in its constructed Chinese Tibetan-medicine benchmark, explicit KB routing and graph-guided evidence fusion improve cross-KB evidence coverage compared with merged retrieval and naive concatenation. It also shows that source-aware routing reduces encyclopedia dominance and that graph-guided methods improve evidence-level coverage and ranking on cross-KB questions.

Cognaptus infers a broader design principle: enterprise RAG systems should treat source authority, source coverage, and evidence packing as explicit architectural decisions. Similarity search alone should not decide what counts as evidence.

What remains uncertain is the size of the benefit outside this setting. The benchmark has 500 queries. The domain is specialized. The language is Chinese. The knowledge bases are cleanly partitioned into three categories. The generator is openPangu-Embedded-7B. Automatic metrics are judged by GLM-4.7. The authors do not use a train/dev/test split, and all metrics are computed on the full query set with fixed hyperparameters.

These are not fatal limitations. They are interpretation boundaries.

The paper should not be read as proof that DAKS plus graph fusion is a universal RAG recipe. It is better read as evidence for an architectural claim: when evidence sources differ in authority and style, provenance must be designed upstream, not patched onto the final answer.

The uncomfortable lesson: citations are not enough

Many RAG systems now produce citations. That sounds reassuring until we ask a more serious question: citations to what?

A citation to an easy summary is not the same as citation to the governing policy. A citation to an encyclopedia entry is not the same as citation to clinical evidence. A citation to a document chunk is not useful if the system cannot explain why that source was selected over another.

This paper’s contribution is not that it makes RAG more complicated. RAG in serious domains was already complicated. The paper simply refuses to hide the complexity behind a merged index and a confident answer.

For enterprise AI, that is the right instinct. The future of reliable RAG will not be won by dumping more text into longer prompts. It will be won by systems that know the difference between relevance, authority, coverage, and justification.

The retriever should find. The router should choose. The graph should connect. The packer should preserve evidence structure. The generator should answer within those constraints.

That sounds less magical than “ask your documents anything.”

Good. Magic is a poor compliance strategy.

Cognaptus: Automate the Present, Incubate the Future.

Fengxian Chen, Zhilong Tao, Jiaxuan Li, Yunlong Li, and Qingguo Zhou, “Traceable Cross-Source RAG for Chinese Tibetan Medicine Question Answering,” arXiv:2602.05195, 2026. https://arxiv.org/html/2602.05195 ↩︎

The first failure is density bias, not hallucination#

DAKS treats retrieval as source-budget allocation#

The second failure is noisy fusion#

The main evidence: provenance improves, but components depend on each other#

The evidence-level tests show what the graph is actually doing#

What this means for enterprise RAG design#

What the paper directly shows, and what it only suggests#

The uncomfortable lesson: citations are not enough#