Evidence is not context.

That is the small, expensive misunderstanding behind many enterprise RAG systems. A user asks a question, the system retrieves semantically similar chunks, the model reads them, and the answer arrives with a tone that suggests the matter has been settled. Very reassuring. Sometimes even correct.

But in the situations where RAG is supposed to be most useful — compliance reviews, financial analysis, legal memos, medical evidence summaries, internal strategy briefings — the problem is often not that the system has too little relevant material. The problem is that the relevant material disagrees, overlaps, dates badly, or supports several competing interpretations at once.

A new paper, Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG, by Davide Di Gioia, argues that RAG systems are optimizing the wrong intermediate objective.1 They ask, “What is most relevant to the query?” The more serious question is, “Which piece of evidence would reduce uncertainty over the possible answers?”

That sounds like an academic distinction until a system retrieves five highly relevant passages that all repeat the same weak claim and ignores the one boring table row that would actually decide the issue. Relevance is a popularity contest. Evidence selection should be a diagnostic procedure. The paper’s contribution is to turn that diagnostic instinct into an inference-time controller: Entropic Claim Resolution, or ECR.

The useful way to read the paper is not as another RAG benchmark story. It is not mainly saying, “Here is a retriever with better scores.” It is saying: retrieval should be treated as upstream candidate generation, while the core reasoning loop should decide which atomic claim is worth evaluating next. That loop is driven by expected entropy reduction over competing answer hypotheses.

In plainer language: stop feeding the model more text that sounds related. Feed it the claim that would most change what it should believe.

The failure mode is redundant relevance, not missing relevance

The paper names the central failure mode epistemic collapse. In ordinary relevance-driven RAG, retrieval tends to collect material similar to the query and similar to each other. This works for simple factoid questions. It becomes brittle when the corpus contains ambiguity, contradiction, or multiple plausible answers.

Suppose an analyst asks:

“Did the company’s margin improvement come from pricing power or cost reduction?”

A relevance-based system may retrieve several management-comments passages about “operational discipline,” “pricing environment,” and “improved gross margin.” They all look relevant. They may even all be useful. But if they do not separate the two causal explanations, the model is still stuck with a pile of context and no decision rule. The answer becomes a polished synthesis of unresolved uncertainty. In enterprise settings, that is how hallucination learns to wear a tie.

ECR reframes the task. Instead of treating retrieval as “find similar documents,” it treats answer generation as a probabilistic process over competing hypotheses.

At the beginning, the system constructs a hypothesis space:

Candidate answer hypothesis Example in a business query
$a_1$ Margin improved mainly because prices increased
$a_2$ Margin improved mainly because costs fell
$a_3$ Margin improved because of a mixed or ambiguous effect

The system then maintains a probability distribution over these hypotheses. The uncertainty in that distribution is measured by Shannon entropy:

$$ H(A \mid X_{eval}) = -\sum_{a \in A} P(a \mid X_{eval}) \log_2 P(a \mid X_{eval}) $$

Here, $A$ is the answer hypothesis space, and $X_{eval}$ is the set of evidence claims already evaluated. High entropy means the system still has multiple live explanations. Low entropy means probability mass has concentrated enough to justify a more decisive answer.

The core shift is this: ECR does not ask which claim is most semantically similar to the query. It asks which claim has the highest Expected Entropy Reduction. In effect:

$$ EER(c \mid X_{eval}) = H(A \mid X_{eval}) - \mathbb{E}\ast{X_c}[H(A \mid X\ast{eval} \cup {X_c})] $$

The selected claim is the one expected to reduce answer uncertainty the most.

That mechanism makes a claim useful only if it distinguishes among hypotheses. A passage that supports every answer equally is not very valuable, even if it is full of familiar keywords. A small claim that sharply separates one explanation from another is valuable, even if it would not win a cosine-similarity beauty pageant.

ECR is a controller placed after retrieval, not a magic retriever

A common misreading would be to treat ECR as a new retrieval model. That is not quite right.

The paper assumes that upstream retrieval has already produced a high-recall candidate pool. ECR works on that pool. It is an inference-time controller sitting between candidate generation and answer synthesis.

The rough pipeline looks like this:

Query
Multi-strategy retrieval
(vector / graph / claim retrieval)
Candidate atomic claims
ECR controller
- initialize answer hypotheses
- estimate entropy
- choose next claim by expected entropy reduction
- update posterior probabilities
- stop at epistemic sufficiency or expose unresolved conflict
Answer synthesis or ambiguity report

This placement matters. It means ECR does not eliminate the need for good retrieval. If the upstream retriever never finds the decisive claim, entropy control cannot summon it from the void. The paper is explicit that ECR assumes a candidate claim set has been generated with enough recall. The algorithm then asks which of those claims should be evaluated first, next, and finally.

That is a more modest claim than “we solved RAG.” It is also more useful.

Enterprise RAG failures are often not caused by one missing module. They come from badly ordered evidence: redundant passages are processed early, discriminative claims are buried, contradictions are blended, and the system stops because a token budget or iteration budget says so. ECR replaces that procedural guesswork with a decision rule.

The stopping rule is the quiet engineering contribution

Most RAG systems stop for embarrassingly practical reasons:

  • they hit top-$k$;
  • they run out of context budget;
  • an agent loop reaches a fixed iteration cap;
  • a prompt says something comforting about confidence.

ECR introduces a mathematical stopping condition: stop when entropy over the answer hypotheses falls below a predefined threshold $\epsilon$, subject to coherence.

In simple form:

$$ H(A \mid X_{eval}) \leq \epsilon $$

This is a definition of “enough evidence.” Not perfect evidence. Not metaphysical truth. Enough evidence under the current hypothesis space and evaluated claims.

That last phrase is important. The paper’s framework is bounded by the hypotheses it constructs and the candidate claims it receives. But within that bounded world, the system can say something much cleaner than “the model seems confident.” It can say: the remaining uncertainty over competing answers is below threshold.

The paper also adds a coherence-aware term for contradictions. If a new claim would complete an explicit contradiction pair with an already evaluated claim, ECR can prioritize surfacing it. The point is not to average contradictions away. It is to notice that the corpus itself may be incoherent.

This is where the paper becomes more interesting than ordinary uncertainty calibration. Many uncertainty methods estimate whether an answer is likely to be wrong after generation. ECR tries to guide evidence selection before synthesis. It is not merely checking whether the answer is shaky. It is choosing the next claim that would make the answer less shaky — or reveal that shakiness cannot be resolved.

The architecture is claim-centric because chunks are too blunt

ECR works over atomic claims, not raw document chunks. That is not a cosmetic design choice.

Chunks are convenient for retrieval systems and irritating for reasoning systems. A chunk may contain several claims, some relevant, some irrelevant, some stale, some contradictory. Treating it as a single evidence unit invites exactly the kind of blending that enterprise RAG is supposed to avoid.

The paper integrates ECR into a broader claim-centric CSGR++ / HyRAG-style architecture. The surrounding system includes structured ingestion, entity-attribute-value storage, claim extraction, vector indices, graph expansion, temporal filtering, confidence updates, and reverse verification. Much of this architecture is supporting infrastructure rather than the conceptual core of ECR. Still, it tells us what kind of system ECR expects to live inside.

System layer Operational role Why it matters for ECR
Atomic claim extraction Converts documents, tables, and summaries into smaller evidence units ECR can select evidence at the level where uncertainty is actually resolved
Provenance and structural metadata Tracks source, row, entity, temporal key, and support/contradiction signals Verification can use more than prompt-based LLM judgment
Multi-strategy retrieval Builds the candidate pool using vector, graph, and claim retrieval ECR depends on high recall before it can make good selections
Entropic claim resolver Chooses claims by expected entropy reduction and stops by sufficiency Controls evidence order and termination
Reverse verification Checks numeric grounding and claim support after synthesis Adds a second guardrail, especially for structured data

This matters for business implementation. You cannot simply paste “use entropy” into a prompt and expect ECR-like behavior. The system needs an evidence layer that preserves claim identity and provenance. It needs candidate hypotheses. It needs a way to update belief after each claim. It needs contradiction signals. Otherwise, “entropy-aware RAG” becomes one more architectural slogan wandering the conference circuit looking for a dashboard.

The controlled experiment tests the mechanism, not end-user magic

The paper’s evaluation is best read in layers. The controlled claims-only harness is the main test of the mechanism. It fixes the dataset, query set, candidate claim pool, and posterior model. Only the claim-selection policy changes.

That design is narrow by intention. It strips away upstream retrieval quality and downstream generation quality so the paper can ask a clean question: does ECR reduce hypothesis entropy better than relevance-only selection or random selection?

The controlled dataset contains six business-style CSV tables — sales, customers, expenses, inventory, HR, and marketing — with 80 templated evaluation queries. For each query, the harness constructs three mutually exclusive hypotheses, giving an initial entropy of $\log_2 3 \approx 1.585$ bits. Each method receives the same top-20 candidate claims.

The results are not subtle:

Policy Claims used Final entropy Entropy drop per claim Claims-to-collapse
Retrieval-only 15.0 1.585 0.0000 16.0
ECR 5.0 0.2129 0.2744 5.0
Random 5.0 1.243 0.0684 6.0

In this harness, retrieval-only uses three times as many claims as ECR and does not reduce entropy under the posterior model. Random selection helps a little, which is what randomness does occasionally when it stops embarrassing itself. ECR collapses uncertainty below the $\epsilon = 0.3$ threshold using five claims.

The key interpretation is not “ECR is always cheaper by exactly this amount.” The harness is controlled and synthetic enough that the number should not be treated as a universal performance estimate. The stronger point is structural: when the evaluation metric is uncertainty resolution, relevance-only retrieval can be redundant even when it retrieves plenty of material.

The multi-seed robustness test supports the same mechanism. Retrieval-only and ECR are deterministic in the frozen setup, while the random baseline varies across seeds. ECR remains at final entropy 0.2129 with five claims; random averages around 1.2628 final entropy. This is a robustness check of the controlled comparison, not a second thesis.

HotpotQA shows discipline without a large accuracy penalty

The end-to-end evaluation moves from the controlled harness to a HotpotQA-style multi-hop QA benchmark with 300 questions. Here, a live language model is reintegrated into the pipeline. All methods share the same retriever, language model, candidate evidence pool, and decoding parameters. The variable is the inference-time evidence selection policy.

Method Average claims used Exact Match Token F1 Evidence faithfulness
Baseline RAG 19.87 0.313 0.459 0.639
Random control 19.87 0.207 0.307 0.427
ECR 19.68 0.297 0.450 0.626

The clean reading is not that ECR beats a strong relevance baseline on a standard QA benchmark. It does not. Baseline RAG is slightly ahead on exact match, F1, and faithfulness. ECR is close, and far ahead of random selection.

That distinction matters. HotpotQA mostly rewards finding factual evidence for singular ground-truth answers. It is not designed to stress contradictory corpora or fundamental ambiguity. In that environment, relevance is already a strong heuristic. ECR’s achievement is that adding epistemic control does not badly damage ordinary QA performance.

For enterprise readers, that is the right bar. If an uncertainty-aware controller collapses on routine factual QA, nobody should deploy it. The paper shows near-parity under a relevance-friendly benchmark, then uses contradiction-focused tests to examine where ECR’s control logic should matter more.

The noise test is a robustness check, not proof of contradiction mastery

The noisy-evidence experiment replaces 40% of retrieved candidate claims with unrelated claims before evidence selection. This is closer to real deployment, where retrieved material can include irrelevant or misleading evidence. But the setup also has a complication: replacing candidates may remove gold evidence. So lower exact match partly reflects evidence loss, not just distractor filtering.

Method EM, no noise Faithfulness, no noise EM, 40% noise Faithfulness, 40% noise
Baseline RAG 0.323 0.660 0.167 0.345
ECR 0.307 0.657 0.163 0.331

The purpose of this ablation is modest: check whether ECR behaves predictably when candidate evidence is corrupted. It does. Both systems degrade. ECR stays close to baseline rather than amplifying the damage.

This should not be oversold. The test is not a clean demonstration that ECR filters arbitrary distractors better than a relevance baseline. The paper itself notes that distractor accumulation without evidence removal remains a separate regime for future work. Good. That sentence saves us from the usual benchmark inflation ceremony.

The contradiction test is where the paper earns its title

The most important evaluation is the offline contradiction-injection test. This test is not about ordinary answer accuracy. It is designed to isolate the controller’s behavior under structured contradiction.

The experiment injects paired contradiction twins into the candidate claim pool at rates $\alpha = 0.0$, $0.3$, and $0.5$. The setup is deterministic and offline: hypothesis initialization uses hashed embeddings, and verification uses a deterministic provenance proxy. This removes live LLM semantics from the loop and asks a narrower question: when the evidence contains explicit conflict, does the controller force a dominant answer or expose ambiguity?

The result is the paper’s most business-relevant signal:

Method Contradiction rate $\alpha$ Overconfident error Ambiguity exposure Mean entropy Stop reason
Baseline RAG 0.0 0.9933 0.0000 fixed budget
Baseline RAG 0.3 0.9933 0.0000 fixed budget
Baseline RAG 0.5 0.9933 0.0000 fixed budget
ECR 0.0 0.9900 0.0100 0.226 epistemic sufficiency
ECR 0.3 0.0000 1.0000 1.496 unresolved conflict
ECR 0.5 0.0000 1.0000 1.458 unresolved conflict

One detail needs careful wording. The paper reports “OverconfErr” and “AmbExp” as the key signals under a deterministic offline answerer. The absolute values are less important than the phase transition: when contradiction is injected and the coherence term is active, ECR stops collapsing uncertainty and instead terminates with unresolved conflict. Baseline RAG remains fixed-budget and overconfident.

This is the practical point. In clean evidence, a system should converge. In contradictory evidence, a system should often refuse convergence. Not because it is timid. Because the evidence has not earned a single answer.

That behavior is especially relevant in domains where “decisive but wrong” is worse than “ambiguous but auditable.” Legal research, medical evidence review, credit analysis, risk reporting, and policy monitoring all contain cases where the correct output is not a synthetic answer. The correct output is a structured explanation of unresolved conflict.

RAG systems have been trained — culturally, not technically — to always answer. ECR pushes in the opposite direction: answer when entropy is low and coherence is acceptable; otherwise expose the competing hypotheses. Less heroic. More useful.

The appendix tests tuning fragility, not a new algorithm

The appendix adds a $\lambda$-sweep over the contradiction-aware coherence bonus. This is the term that nudges ECR to select contradiction-completing claims when such claims exist.

The sweep tests $\lambda \in {0, 0.01, 0.025, 0.05, 0.1}$ under contradiction rates $\alpha \in {0.0, 0.3, 0.5}$. The pattern is sharp: $\lambda = 0$ behaves like entropy-only control and can still converge under contradiction injection, while any tested non-zero value surfaces explicit contradictions and prevents epistemic collapse. The paper sets $\lambda = 0.05$ as the default.

This is a robustness test. Its purpose is to show that contradiction-aware behavior does not depend on delicately tuning the coherence bonus. The result is useful because fragile hyperparameters are how many elegant research ideas become production support tickets.

The appendix also reports a minimal offline sanity test: a single claim and its explicit synthetic negation. ECR evaluates both, flags unresolved conflict, and refuses to emit a dominant hypothesis. That is not glamorous, but sanity tests rarely are. Their job is to prevent clever systems from failing kindergarten logic.

What the paper directly shows, and what businesses should infer

The paper directly shows three things.

First, in a controlled claims-only harness, ECR reduces hypothesis entropy far more effectively than relevance-only or random claim selection under the paper’s posterior model. This is the cleanest evidence for the mechanism.

Second, in a HotpotQA-style benchmark, ECR performs close to a relevance-based baseline while outperforming random selection. This suggests that epistemic control can be added without destroying ordinary multi-hop QA performance, although it does not prove superior accuracy on standard QA.

Third, under offline structured contradiction, ECR with a non-zero coherence term shifts from forced convergence to ambiguity exposure. That is the most distinctive behavior and the most relevant to high-stakes deployment.

The business inference is not “replace every retriever with ECR.” It is this:

Technical contribution Operational consequence ROI relevance Boundary
Select claims by expected entropy reduction Prioritize evidence that changes the answer distribution Fewer redundant context tokens and clearer reasoning traces Requires a good candidate claim pool
Stop at epistemic sufficiency Avoid arbitrary top-$k$ or agent-loop stopping Better latency control and auditability Depends on threshold design and hypothesis quality
Surface unresolved contradiction Avoid confident synthesis from incoherent evidence Lower risk in legal, medical, financial, and compliance workflows May produce non-decisive outputs that users must accept
Claim-level provenance Make evidence selection and verification inspectable Easier review, debugging, and governance Requires ingestion discipline and metadata maintenance

For enterprise teams, ECR points toward a more serious architecture for RAG: one where retrieval, uncertainty modeling, and verification are separate stages with explicit responsibilities. The model is not simply “given context.” It is guided through a sequence of evidence decisions.

That is a better mental model for business systems. A RAG assistant answering a compliance question should behave less like a search intern with a highlighter and more like an analyst deciding which exhibit would settle a dispute.

Where implementation gets difficult

The paper’s limitations are not decorative. They affect whether the method works outside a controlled setting.

The first boundary is hypothesis space coverage. ECR operates over an explicitly constructed set of answer hypotheses. If the true answer is absent, the system may confidently converge to the best available wrong answer. This is not a minor footnote. It is the central failure mode of any method that reasons over a bounded hypothesis set.

The second boundary is candidate recall. ECR assumes upstream retrieval has already found enough relevant and discriminative claims. If the decisive claim is missing, ECR can only rank the evidence it has. A better controller cannot compensate for an evidence pool that never included the key fact.

The third boundary is claim extraction quality. Atomic claims are powerful because they make evidence selectable. They are risky because bad extraction can distort what the evidence says. Tables, dates, qualifiers, and numerical conditions are especially dangerous. “Revenue increased” and “revenue increased excluding one-time effects” are not the same claim, though many systems would happily blur them together and call it semantic understanding.

The fourth boundary is user tolerance for ambiguity. ECR’s refusal to converge under contradiction is a feature. But some business users will still demand a single answer. Product design must make ambiguity useful: show the competing hypotheses, the decisive missing evidence, the contradiction pairs, and the recommended next verification step. Otherwise, “unresolved conflict” will feel like a system failure rather than an honest result.

The fifth boundary is evaluation coverage. The paper’s HotpotQA experiment is useful but not where ECR’s advantage should be largest. The contradiction tests are more aligned with the method’s purpose, but they are offline and controlled. The next research step should involve ambiguity-heavy, conflict-heavy, provenance-sensitive benchmarks where relevance-driven RAG is structurally disadvantaged.

The better enterprise question is not “how much context?”

The industry has spent a long time treating context size as a substitute for reasoning discipline. When answers fail, the reflex is familiar: retrieve more, expand the window, add another agent step, ask the model to reflect, then hope the bill looks intentional.

ECR suggests a different question:

What evidence would most reduce uncertainty over the competing answers?

That question changes the architecture. It makes atomic claims more important than chunks. It makes provenance more important than fluency. It makes stopping rules more important than vibes. It also makes contradictions first-class evidence rather than inconvenient texture to be smoothed over during synthesis.

The paper does not prove that ECR is a universal replacement for relevance-based RAG. It proves something narrower and more interesting: relevance is not the same thing as evidentiary value. In clean factual QA, relevance may be enough. In messy enterprise corpora, relevance can become a trap because it retrieves what sounds connected, not what resolves the decision.

For Cognaptus readers building business automation systems, that distinction is practical. A useful AI system should not merely answer from documents. It should know when a document claim changes the answer, when a claim repeats what is already known, and when the evidence is too contradictory to justify synthesis.

That is not bigger RAG. It is better epistemic accounting.

And, yes, it makes many current RAG demos look slightly unserious. They retrieve five agreeable passages, generate a confident paragraph, and call the result grounded. Grounded in what, exactly — semantic resemblance and optimism?

ECR gives us a sharper standard: retrieve broadly, reason over claims, select evidence by information gain, stop only when uncertainty is sufficiently low, and expose contradiction when coherence fails.

That is the right direction. Not because entropy is fashionable. Because businesses do not need more relevant noise. They need systems that know what evidence would actually change the answer.

Cognaptus: Automate the Present, Incubate the Future.


  1. Davide Di Gioia, “Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG,” arXiv:2603.28444v1, 30 March 2026, https://arxiv.org/abs/2603.28444↩︎