The Retriever Found Similar Things. The Evidence Was Elsewhere.

TL;DR for operators

The current enterprise RAG conversation still has a charmingly stubborn misconception: if the model hallucinates, buy better embeddings, increase the context window, add an agent, and hope the PowerPoint becomes true.

The two papers here point in a less theatrical direction. One paper, Non-negative Elastic Net Decoding for Information Retrieval, argues that dense retrieval has a structural weakness: it scores each candidate independently, so it can retrieve several similar items instead of the complementary set actually needed to answer the query.¹ The other, Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis, shows what happens when retrieval is treated as a full evidence workflow: sparse and dense retrieval are fused, queries are decomposed under constraints, evidence is deduplicated and budgeted, and answers are judged for coverage, hallucination, and abstention.²

The business lesson is simple enough to be annoying: RAG is not a search box with a chatbot glued to the end. It is an evidence supply chain.

That supply chain has five jobs:

Supply-chain job	What goes wrong when ignored	What operators should measure
Select complementary evidence	The retriever returns five near-duplicates and misses the missing premise	Completeness, multi-source coverage, redundancy
Preserve exact terminology	Dense search misses acronyms, part numbers, legal clauses, formulas, or domain tags	Lexical hit rate, named-entity recall, source-level recall
Expand queries carefully	The agent wanders into related-but-wrong material	Drift rate, subquery quality, evidence relevance
Ground generation	The model treats retrieved text as decorative garnish	Citation fidelity, unsupported-claim rate
Evaluate the end product	Good retrieval metrics hide bad answers	Key-point coverage, abstention accuracy, answer correctness

The punchline: the useful future of RAG is not “more semantic similarity.” It is controlled evidence assembly.

Why this matters now

Enterprise AI is moving from “answer my question” demos into operational systems: compliance assistants, financial research copilots, technical support agents, engineering knowledge bases, procurement reviewers, medical literature tools, litigation support systems, policy trackers, and internal decision-support platforms.

These are not simple question-answering systems. They often need to assemble evidence from several places: a policy exception here, a contract clause there, a product manual footnote, a risk memo, a spreadsheet definition, and some regulatory guidance that nobody remembered existed until the audit.

The common architecture is retrieval-augmented generation, or RAG. In theory, RAG grounds the model in external documents. In practice, many systems are merely grounded in whatever the retriever happened to fetch before lunch.

That distinction matters. A language model can only reason over the evidence it receives. If retrieval returns redundant, incomplete, stale, or semantically plausible but operationally wrong material, the generator may still produce a beautifully formatted answer. This is the dangerous part. Bad evidence does not always look bad after a fluent model has ironed its shirt.

The two papers are useful together because they sit at different layers of the same problem. The first asks: what if the scoring rule inside retrieval is itself wrong for multi-evidence tasks? The second asks: what does a practical evidence-grounded RAG workflow need once retrieval is embedded in a scientific analysis system?

Together, they form a logic chain:

Similarity-only retrieval can over-select redundant material.
Joint set selection can recover complementary evidence more directly.
Domain RAG needs both semantic and lexical retrieval signals.
Agentic query expansion helps only when constrained.
The answer must be evaluated as an evidence product, not as a vibes product.

A revolutionary finding, apparently: evidence systems should be evaluated on evidence.

The first link: similarity is not the same as usefulness

Dense retrieval usually works like this: encode a query into a vector, encode each document or chunk into vectors, score each candidate by inner product or cosine similarity, and return the top-k.

In simplified form:

$$ \text{score}(q, d_i) = q^\top d_i $$

This is efficient, scalable, and often very good. It is also myopic. Each document is scored independently against the query. The retriever does not ask, “Do these retrieved items collectively cover the information need?” It asks, “Which individual items look closest to the query?”

That is often fine when the user needs one fact from one document. It is much less fine when the answer requires a set of complementary items.

The first paper’s core argument is that dense retrieval can be structurally biased toward redundancy. If several documents or tools are semantically similar, they may all score highly, even when only one is useful and another necessary but less superficially similar item is missed. This problem becomes especially visible in tool retrieval and multi-hop retrieval, where the goal is not to retrieve “similar things” but to retrieve “the complete set of things needed to solve the task.”

The authors propose Non-negative Elastic Net decoding, or NNN decoding. Instead of scoring each document independently, NNN decoding treats retrieval as a joint reconstruction problem. The query embedding is reconstructed as a sparse non-negative combination of corpus embeddings:

$$ \min_{w \ge 0} \frac{1}{2}|q - Xw|_2^2 + \lambda_1|w|_1 + \frac{\lambda_2}{2}|w|_2^2 $$

Here, $q$ is the query embedding, $X$ is the corpus embedding matrix, and $w$ assigns non-negative coefficients to candidate documents. The selected documents are those with nonzero or high coefficients.

That shift is small in wording and large in meaning. Dense retrieval asks which documents are closest to the query one by one. NNN decoding asks which set of documents collectively explains the query.

This creates a built-in anti-redundancy mechanism. If one document already explains a part of the query, another highly correlated document may add little residual value. The decoder can suppress it and allocate weight to a complementary item instead. It is not diversity as corporate decoration. It is diversity because the math needs enough different pieces to reconstruct the signal.

The authors support this with theory and experiments. They prove that, under their setup, any query correctly handled by dense retrieval can also be handled by NNN decoding with suitable hyperparameters, and that there are corpora where NNN decoding succeeds while dense retrieval fails. They then test the method on tool-retrieval and multi-hop retrieval benchmarks, reporting improvements in recall and especially completeness. Completeness is the right metric here because partial recovery is not enough when the task needs every required item.

For operators, this is the important translation:

A RAG system can retrieve relevant chunks and still fail because it did not retrieve the necessary set.

That single sentence explains a depressing amount of enterprise AI disappointment.

The second link: production RAG needs more than better vectors

The second paper moves from retrieval mechanism to workflow design. Its domain is muon-collider scientific literature, which is nicely inconvenient in all the ways enterprise knowledge systems are inconvenient. The knowledge is distributed across subfields. The terminology is technical. Acronyms matter. Evidence is fragmented across papers. Some questions require synthesis. Some should not be answered from the available corpus at all.

In other words, it is not a toy RAG setup. Excellent.

The authors build an agentic hybrid RAG framework for evidence-grounded scientific question answering in muon-collider research. The retrieval corpus contains 215 publications segmented into 5,813 indexed chunks. The benchmark includes retrieval questions and a separate answer-generation benchmark, which is an important design choice because retrieval performance and answer quality are not the same thing. This point should be printed on a sticker and attached to every RAG dashboard.

Their retrieval backbone combines sparse and dense retrieval.

Sparse retrieval, using BM25, preserves exact technical terms: acronyms such as BIB, MDI, VBS, and aQGC; named concepts; and domain-specific expressions. Dense retrieval captures paraphrase and conceptual similarity, such as matching “beam-induced background” with “backgrounds from muon decays.” The system fuses sparse and dense rankings using weighted reciprocal-rank fusion:

$$ \text{RRF}(d) = \frac{w_{\text{dense}}}{k + r_{\text{dense}}(d)} + \frac{w_{\text{sparse}}}{k + r_{\text{sparse}}(d)} $$

The paper’s default configuration uses a dense weight of 0.9 and a sparse weight of 0.1 after optimization on its benchmark. That result is domain-specific, not a universal law. The important point is not “always use 90/10.” The important point is that dense retrieval was strongest, but a modest lexical component still helped preserve sensitivity to exact scientific terminology.

This is highly relevant to business systems. Enterprises are full of exact terms that embeddings may politely blur: invoice codes, policy IDs, SKU names, chemical identifiers, internal project names, regulatory article numbers, contract definitions, ticket labels, product versions, and the sacred spreadsheet tab called “FINAL_v7_USE_THIS_ONE.”

Dense retrieval is good at meaning. Business operations often also require exactness.

The paper then adds agentic query decomposition. The system tags the query by domain, classifies its intent, and generates a limited number of retrieval-oriented subqueries. These subqueries are not supposed to answer the question. They are supposed to retrieve supporting evidence. Precise fact queries get fewer expansions; reasoning and synthesis questions get broader decompositions along mechanisms, motivations, limitations, or domain boundaries.

That constraint matters. Uncontrolled agentic retrieval is a wonderful way to expand a narrow question into a guided tour of adjacent nonsense. The paper’s agentic layer is not a free-roaming intern with a browser. It is a controlled expansion module sitting on top of the same hybrid retrieval backbone.

The retrieved chunks are then merged, deduplicated, and constrained to a fixed evidence budget before answer generation. The generator is instructed to ground answers in the provided evidence, cite support, and abstain when the evidence is insufficient.

This is the operational version of the first paper’s mechanism-level concern. The problem is not merely “find similar text.” The problem is “assemble a usable evidence set under constraints.”

The chain: from retrieval scoring to evidence operations

The papers are not saying the same thing. They are more useful than that.

One paper attacks the scoring rule. The other designs a controlled RAG pipeline. Together, they suggest a layered architecture for evidence-grounded AI.

Logic-chain step	Mechanism-level paper	Workflow-level paper	Business interpretation
1. Retrieval can be incomplete even when results look relevant	Dense top-k scoring can select redundant, correlated items	Scientific questions often require cross-document evidence aggregation	“Looks relevant” is not the same as “contains the missing premise”
2. Retrieval should select sets, not isolated hits	NNN decoding jointly reconstructs the query from corpus embeddings	Evidence is aggregated from original and decomposed subqueries	Measure evidence-set completeness, not just top-1 similarity
3. Domain retrieval needs multiple signals	The first paper focuses on dense embeddings and joint decoding	Hybrid RRF combines semantic and lexical retrieval	Exact terms and conceptual matches both matter
4. Expansion must be controlled	The first paper reduces redundancy through joint selection	The second caps and classifies subqueries	Agentic retrieval should have brakes, not just a steering wheel
5. Generation must be evaluated downstream	The first paper evaluates retrieval recall and completeness	The second evaluates answer correctness, key-point coverage, hallucination, and abstention	Retrieval metrics are not business outcome metrics

This is the structure operators should care about. The first paper says: the retriever may be mathematically optimized for the wrong behavior. The second says: even when retrieval is stronger, the answer pipeline still needs grounding, evaluation, and abstention.

That last point is especially important. In the muon-collider RAG paper, hybrid retrieval performs best on chunk-level retrieval metrics. But the corresponding Hybrid RAG baseline does not outperform Vanilla RAG on answer generation. The agentic hybrid approach performs best overall on answer quality, with stronger key-point coverage and a lower hallucination rate than Vanilla RAG in the reported benchmark.

Translation: better retrieval scores do not automatically become better answers. There is a whole messy middle where evidence has to be organized, deduplicated, budgeted, cited, and used. Apparently pipelines matter. DevOps people may now enjoy a quiet moment of vindication.

The business misconception: “We need a better embedding model”

Sometimes yes. Often no. Frequently, “better embeddings” are the most convenient answer because they allow everyone to avoid redesigning the workflow.

The papers suggest a different diagnostic.

When a RAG system fails, ask where the evidence supply chain broke:

Failure symptom	Likely retrieval-chain issue	Useful fix
Answer cites several similar chunks but misses a key exception	Redundant top-k retrieval	Complementarity-aware retrieval, MMR, joint decoding, evidence coverage checks
Answer misses exact IDs, acronyms, clauses, or names	Dense retrieval over-generalized	Add sparse retrieval, metadata filters, exact-match boosts
Answer contains correct facts but incomplete reasoning	Query not decomposed into evidence needs	Controlled subquery generation by intent
Answer uses weak evidence confidently	Generator not constrained by evidence quality	Grounded generation instructions, citation checks, unsupported-claim evaluation
System answers unanswerable questions	No abstention layer	Unanswerable benchmark cases and abstention scoring
Retrieval dashboard looks good but users distrust answers	Evaluation stops before generation	Answer-level rubrics and task-specific correctness checks

This reframes RAG from a model-selection problem into an operating model. The question is not just “Which embedding model should we use?” It is:

What evidence units are required to answer this class of questions?
How do we avoid retrieving five versions of the same premise?
Which exact terms must never be blurred into semantic mush?
When should the system decompose a query?
How do we prevent decomposition from drifting?
What evidence budget is available to the generator?
What claims require citations?
What should the system refuse to answer?

That is less glamorous than “agentic AI.” It is also more likely to work.

What the papers show — and what they do not

It is worth separating the evidence from the interpretation.

The NNN decoding paper shows that independent dense scoring has a structural limitation for retrieving complementary sets. It proposes a joint decoder based on non-negative elastic net regression, proves a theoretical separation under its formulation, and reports experimental gains on tool-retrieval and multi-hop retrieval benchmarks. Its strongest business relevance is not that every company should immediately implement NNN decoding tomorrow morning. The relevant lesson is that retrieval scoring rules encode assumptions about what counts as success. Dense top-k assumes independent relevance. Many enterprise tasks require set completeness.

The agentic hybrid RAG paper shows that, in a technical scientific domain, hybrid retrieval can outperform sparse-only and dense-only retrieval on their retrieval benchmark, and that controlled agentic evidence expansion can improve answer-level performance. It also shows that retrieval metrics alone are insufficient because the best retrieval backbone does not automatically produce the best generated answers. Its strongest business relevance is the workflow pattern: hybrid retrieval, constrained decomposition, evidence aggregation, grounded generation, and answer-level evaluation.

Neither paper proves that there is one universal RAG architecture. They do not eliminate the need for domain-specific tuning. They do not solve all hallucination. They do not make agents magically trustworthy. They do not give managers permission to write “fully autonomous research analyst” on a roadmap and then flee the room.

They do something more useful: they identify where the naive architecture breaks.

A practical framework: the evidence assembly maturity model

For operators, the combined lesson can be turned into a maturity model.

Level 1: Similarity retrieval

The system retrieves top-k chunks by vector similarity and passes them to a model.

This is the default demo architecture. It can work for simple lookups. It is fragile for multi-step questions, exact terminology, and regulated use cases.

Typical metric: top-k similarity or rough retrieval hit rate.

Typical failure: “The answer sounded right but missed the relevant exception.”

Level 2: Hybrid retrieval

The system combines dense semantic retrieval with sparse lexical retrieval, metadata filters, or symbolic constraints.

This is where the system starts respecting the fact that business language contains exact artifacts. A contract clause is not just a vibe. A chemical code is not a synonym party.

Typical metrics: precision@k, recall@k, MRR, source-level recall, entity recall.

Typical failure: “We found the right document but not the right evidence fragment.”

Level 3: Complementarity-aware retrieval

The system optimizes not just for individually relevant chunks but for evidence-set coverage. This can involve diversity-aware reranking, MMR-style approaches, joint decoding, coverage constraints, or task-specific completeness metrics.

This is the level most RAG systems quietly need and rarely admit.

Typical metrics: completeness@k, redundancy rate, required-evidence coverage, multi-hop success.

Typical failure: “The system retrieved relevant evidence, but not all the evidence needed to answer.”

Level 4: Controlled agentic expansion

The system decomposes complex questions into targeted subqueries, but does so under explicit rules. It limits the expansion budget, classifies query type, prevents unsupported inventions, and routes subqueries through the same governed retrieval layer.

This is where agents become useful rather than decorative.

Typical metrics: subquery relevance, drift rate, incremental evidence gain, duplicate retrieval rate.

Typical failure: “The agent expanded the query into a neighboring topic and came back very confident.”

Level 5: Grounded answer operations

The system evaluates final answers against reference criteria, required key points, unsupported claims, citation fidelity, and abstention behavior. Retrieval is no longer treated as the whole product.

This is where RAG becomes operationally measurable.

Typical metrics: answer correctness, key-point coverage, hallucination rate, citation support, abstention accuracy, escalation rate.

Typical failure: “The retriever benchmark was green, but the business process still failed.”

The uncomfortable truth is that many organizations are still around Level 1 while presenting themselves as Level 4 because an agent icon appears somewhere in the UI. A small animated sparkle does not count as evidence governance.

What this means for enterprise AI design

The combined conclusion is straightforward: build RAG as an evidence supply chain.

That means designing the system around evidence movement, not model magic.

A serious enterprise RAG design should include at least five control surfaces.

1. Evidence-unit design

Before choosing retrieval algorithms, define what the system retrieves.

Chunks are not neutral. A chunk can be too small to contain a complete premise or too large to be useful as evidence. Scientific papers, contracts, manuals, meeting notes, code repositories, and case files need different segmentation strategies.

The muon-collider paper works at chunk level while preserving metadata for source traceability. That matters because the system must retrieve evidence fragments without losing document provenance.

Enterprise equivalent: preserve source, section, version, author, effective date, jurisdiction, confidentiality label, and document lineage. Otherwise your chatbot becomes an unusually fluent amnesia machine.

2. Retrieval objective design

Do not optimize only for top-1 relevance if the task requires complete evidence.

The NNN paper’s use of completeness is a useful reminder. In tool retrieval and multi-hop retrieval, partial recovery is often failure. If the system needs three tools, retrieving two is not “mostly correct.” It is broken with better manners.

Enterprise equivalent: define required evidence sets for representative tasks. For a compliance answer, that may include the policy, the exception, the approval threshold, and the latest regulatory update. For technical support, it may include product version, known issue, workaround, and escalation condition.

3. Signal fusion

Dense retrieval is good at semantic matching. Sparse retrieval is good at exact terms. Metadata filters are good at hard constraints. Rules are good when the business process actually has rules, which occasionally happens despite everyone’s best efforts.

The muon-collider paper’s hybrid setup is not just a retrieval trick. It is an epistemic hedge. It says: no single matching signal deserves a monopoly.

Enterprise equivalent: combine semantic retrieval with exact-match boosts, metadata filters, access controls, recency constraints, and domain ontologies where appropriate.

4. Controlled expansion

Agentic query decomposition is useful when the original query bundles several evidence needs. But expansion should be typed, budgeted, and auditable.

The paper’s decomposition process classifies query intent and limits subqueries. Precise fact questions get narrow expansions. Reasoning questions get mechanism and limitation angles. Broad synthesis questions are split by domain or process boundary.

Enterprise equivalent: a procurement policy question, a legal risk question, and a customer-support troubleshooting question should not use the same expansion strategy. One wants exactness. One wants authority. One wants a diagnostic tree. Treating all three as “ask the vector database harder” is adorable, in the way a paper umbrella is adorable during a typhoon.

5. Answer-level evaluation

The generator is where evidence becomes a business-facing claim. That is where evaluation must land.

The muon-collider paper evaluates Good rate, Satisfactory-or-Better rate, key-point coverage, hallucination, and abstention. The exact metrics will differ by business domain, but the principle travels well.

Enterprise equivalent:

Domain	Evaluation should include
Compliance	correct policy application, citation to current source, exception handling, abstention on unsupported cases
Customer support	resolution accuracy, product-version match, escalation appropriateness, unsafe-instruction avoidance
Finance	source freshness, numerical consistency, assumption disclosure, unsupported-claim rate
Legal operations	jurisdiction match, clause support, privilege/access constraints, refusal on insufficient evidence
Engineering knowledge	version match, dependency awareness, reproducible steps, link to source issue or commit

Retrieval metrics are necessary. They are not sufficient. If the final answer is what the business uses, evaluate the final answer.

A radical thought. Someone alert the dashboard.

The strategic implication: evidence coverage is the moat

A lot of AI vendors are converging on the same surface features: chat over documents, agentic workflows, citations, connectors, and dashboards. Differentiation will not come from saying “we use RAG.” That phrase is already well on its way to becoming wallpaper.

The real differentiation will come from evidence coverage and evidence control:

Can the system retrieve complementary evidence rather than redundant chunks?
Can it handle exact terms and semantic paraphrases?
Can it decompose complex queries without drifting?
Can it preserve source traceability?
Can it abstain when evidence is insufficient?
Can it explain which evidence was used and which was missing?
Can it be evaluated against business-specific answer rubrics?

This is where retrieval becomes operational infrastructure. Not a feature. Infrastructure.

For business leaders, the procurement question should shift from “Does your system have RAG?” to “Show me how your system assembles, controls, and evaluates evidence.”

For technical leaders, the architecture question should shift from “Which embedding model?” to “Which retrieval objective matches the task?”

For risk leaders, the governance question should shift from “Does the model cite sources?” to “Are the cited sources sufficient, current, authorized, and actually supportive of the claim?”

Because a citation is not grounding. A citation is an invitation to check whether grounding happened. Sometimes it did. Sometimes the model just stapled a source to a sentence and hoped nobody was in a reading mood.

A better operating question

The combined value of these papers is not that they give one final answer to retrieval. They do not. One proposes a mechanism-level alternative to independent dense scoring. The other proposes and evaluates a system-level scientific RAG workflow. They are complementary because they expose the same deeper principle from different angles:

Retrieval should be designed around the evidence required to complete the task, not around the items most similar to the query.

That principle should change how companies build and evaluate AI systems.

A weak RAG system says:

“Here are the chunks that looked most similar.”

A stronger RAG system says:

“Here is the evidence set required to support this answer, here is how it was assembled, here is what each source contributes, and here is where the evidence is insufficient.”

The second system is harder to build. Naturally. Useful things often are.

But that is the direction enterprise AI has to move if it wants to graduate from demos to decisions. Similarity is a retrieval signal. It is not an operating model. The operating model is evidence assembly: select, diversify, verify, cite, abstain, and evaluate.

The retriever finding similar things was never the whole job.

The evidence was elsewhere.

Cognaptus: Automate the Present, Incubate the Future.

Koki Okajima, Yasutoshi Ida, Tsukasa Yoshida, and Yasuaki Nakamura, “Non-negative Elastic Net Decoding for Information Retrieval,” arXiv:2606.17910, 2026. https://arxiv.org/abs/2606.17910 ↩︎
Ruobing Jiang, Dawei Fu, Cheng Jiang, Tianyi Yang, Zijian Wang, Youpeng Wu, Yong Ban, Yajun Mao, and Qiang Li, “Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis,” arXiv:2606.10381, 2026. https://arxiv.org/abs/2606.10381 ↩︎

TL;DR for operators#

Why this matters now#

The first link: similarity is not the same as usefulness#

The second link: production RAG needs more than better vectors#

The chain: from retrieval scoring to evidence operations#

The business misconception: “We need a better embedding model”#

What the papers show — and what they do not#

A practical framework: the evidence assembly maturity model#

Level 1: Similarity retrieval#

Level 2: Hybrid retrieval#

Level 3: Complementarity-aware retrieval#

Level 4: Controlled agentic expansion#

Level 5: Grounded answer operations#

What this means for enterprise AI design#

1. Evidence-unit design#

2. Retrieval objective design#

3. Signal fusion#

4. Controlled expansion#

5. Answer-level evaluation#

The strategic implication: evidence coverage is the moat#

A better operating question#