Don’t Average the Needle: Spectral Retrieval and the RAG Evidence Problem

Enterprise search has a very old habit wearing a very modern jacket: it averages.

A policy document becomes one vector. A runbook becomes one vector. A postmortem full of operational detail becomes one vector. Then a RAG system asks that one vector whether the document is relevant. This is convenient, fast, and usually defensible — until the relevant answer is a narrow paragraph hiding inside a large document. At that point, the retrieval system is no longer searching for evidence. It is asking a crowd to speak for the witness.

That is the practical problem behind Andrea Morandi’s paper, Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems.¹ The paper’s contribution is not another grand claim that retrieval needs to be “more semantic,” which is the sort of phrase that makes everyone nod and nobody deploy faster. Its more useful claim is narrower: a dense retriever may fail not because the encoder is weak, but because the aggregation rule is too blunt.

The paper proposes Spectral Retrieval, a plug-in second-stage re-ranker. It keeps a fast first-stage dense retriever, then re-ranks candidate documents using per-token document embeddings passed through multi-scale sinc convolution along the token axis. In simpler terms: instead of asking whether the whole document average resembles the query, it asks whether the query matches a document at the right textual scale — token, phrase, paragraph-like span, or document-wide mean.

This matters because many RAG failures are not failures of language understanding. They are failures of evidence placement.

Mean pooling is efficient because it forgets where the evidence was

The usual single-vector dense retrieval recipe is clean. Encode the query into a vector. Encode each document into a vector. Rank by cosine similarity. For document embeddings, that often means pooling token embeddings into one representation, commonly by averaging. The system becomes scalable because each document has exactly one searchable point in vector space.

The cost is spatial amnesia.

A document may contain one highly relevant passage surrounded by routine background. Mean pooling blends that relevant passage with everything around it. If the relevant passage is short, its signal contributes only a small part of the final document vector. A document whose entire vocabulary feels generally related may outrank a document containing the exact operational fact needed by the agent.

This is not an exotic edge case. It is the shape of many business corpora:

Corpus type	Where the important evidence often lives	Why mean pooling can under-rank it
Security runbooks	One procedure inside a broad incident-response document	General security vocabulary dominates the document average
Legal or compliance policies	A narrow clause, exception, or reporting threshold	The surrounding policy language looks similar across many documents
Technical manuals	A specific parameter, warning, or troubleshooting step	Most pages share product and system terminology
Postmortems	A few root-cause paragraphs inside a long narrative	Operational boilerplate dilutes the decisive event detail
Customer-support knowledge bases	One fix among related symptoms	Similar cases share surface vocabulary but differ in the crucial remedy

The reader misconception worth removing is simple: dense retrieval failure is not always an encoder-quality problem. It may be an aggregation problem. The same token embeddings can contain useful local evidence, while the document-level pooling rule quietly buries it. Spectral Retrieval is aimed precisely at that gap.

Spectral Retrieval changes the scale of comparison, not the encoder

The mechanism is easier to understand if we begin with two endpoints.

At one endpoint, there is ordinary mean pooling. Every token contributes to the document representation. This is stable and cheap, but it blurs local relevance.

At the other endpoint, there is per-token maximum similarity, similar in spirit to late-interaction methods such as ColBERT. Instead of asking whether the document average matches the query, the model asks whether any token position strongly matches. This exposes localized evidence, but it requires storing and scoring per-token embeddings.

Spectral Retrieval sits between these endpoints. For each candidate document, it takes the per-token embedding sequence and smooths it along the token axis using sinc kernels at multiple scales. Small scales preserve local evidence. Larger scales blend over wider spans. The widest endpoint approximates mean pooling. The narrowest endpoint is the identity operation, which recovers raw per-token comparison.

The scoring rule is then direct:

Use a fast first-stage retriever to collect candidate documents.
For each candidate, apply sinc convolution over the document’s token embeddings at a grid of scales.
Re-normalize the smoothed token representations.
Compute the maximum cosine similarity to the query over positions and scales.
Use that spectral score to re-rank the candidate pool.

The important phrase is candidate pool. The paper does not propose running this expensive multi-scale scoring over the entire corpus. It is a second-stage re-ranker. The first-stage approximate nearest-neighbor system still does the broad search. Spectral Retrieval only tries to improve the ordering of documents already retrieved.

That keeps the proposal operationally plausible. It also defines its biggest boundary: if the first-stage retriever fails to include the relevant document in the candidate pool, Spectral Retrieval cannot resurrect it. A re-ranker is not a retrieval miracle worker. It is a judge, not a search party.

The endpoint guarantee is useful, but it is not a ranking guarantee

The paper’s cleanest theoretical point is its endpoint recovery guarantee. When the scale grid includes the identity endpoint and an explicit mean-pool endpoint, the spectral score for a document is no lower than either the per-token MaxSim-style score or the mean-pool score.

That guarantee is useful because it says the scoring family contains both extremes. The system can notice a sharp local match and can also preserve document-wide similarity. The multi-scale sweep is not forced to choose one resolution globally.

But this needs careful interpretation. The guarantee is pointwise over scores. It does not mean the final ranking will always dominate mean pooling or per-token scoring. Ranking is comparative across documents. A non-relevant document with one accidental high-similarity token can also receive a strong spectral score. The same sensitivity that finds the needle can also admire a shiny piece of metal.

That is not a defect unique to this paper. It is a known risk of max-style late interaction. The business translation is straightforward: if strict max produces too many suspicious outliers, replace it with a top-$k$ mean, percentile cap, or downstream cross-encoder check. In other words, the mechanism is promising, but the production version should have a moderation layer over its enthusiasm. Retrieval systems, like junior analysts, sometimes need to be told that one exciting sentence is not the whole case.

The synthetic tests explain the mechanism, not the whole market

The paper’s synthetic benchmark is best read as a mechanism test. It creates documents with random token embeddings, plants a single relevant token with controlled cosine similarity to the query, and asks whether the retriever can find the document containing that spike. The purpose is not to simulate enterprise documents. The purpose is to isolate the exact condition under which mean pooling should fail: a narrow localized signal diluted by surrounding noise.

The result is stark.

Planted cosine	MeanCos Recall@10	Spectral Recall@10	Interpretation
0.30	0.015	0.020	Both are effectively near chance; the spike is below the useful token-level noise threshold.
0.45	0.010	0.040	Spectral begins to move, but the signal is still weak.
0.60	0.035	1.000	Once the spike clears the token-level distractor tail, spectral retrieval separates it cleanly.
0.75	0.015	1.000	Mean pooling still fails because the spike remains diluted in the document mean.
0.90	0.025	1.000	Spectral stays saturated; the local evidence is easy once the method can see locally.

The key lesson is not merely “Spectral wins.” The key lesson is why it wins. Mean pooling has access to the planted signal only through the document average. A one-token spike inside a document contributes a small fraction of the mean. Spectral Retrieval, by including the identity endpoint, can inspect the token-level evidence directly.

The paper also runs a width sweep: instead of one relevant token, the planted signal spans multiple adjacent positions. This is a different test with a different purpose. It asks when mean pooling starts to catch up as relevance becomes less needle-like and more paragraph-like.

At planted cosine 0.45, the reported numbers show Spectral Retrieval reaching Recall@10 of 1.000 once the spike width reaches 3 tokens, while MeanCos is still at 0.100. MeanCos rises gradually: 0.200 at width 5, 0.620 at width 10, 0.960 at width 20, and finally 1.000 at width 30. That is exactly the mechanism story the paper wants to tell. Mean pooling improves when relevance occupies enough of the document. Spectral Retrieval improves earlier because it can match at the scale where the signal is concentrated.

This width test is more important than it may first appear. In the single-spike experiment, the spectral gain comes largely from including the per-token endpoint. In the width experiment, intermediate scales have a clearer role: they aggregate a local span without smearing it across the whole document. That is the bridge from toy spike to business paragraph.

LIMIT-small tests the aggregation rule under a real encoder

The paper then evaluates on LIMIT-small, a public benchmark designed to expose limitations of embedding-based retrieval. The setup uses a frozen sentence-transformers/all-mpnet-base-v2 encoder. No task-specific fine-tuning is added. The baseline ranks by mean-pooled document vectors. Spectral Retrieval consumes the same per-token embeddings and changes the aggregation rule.

This design matters. If the encoder changed, the result could be credited to representation learning. Here, the core comparison is cleaner: same underlying encoder, different scoring rule.

The headline results are large:

Metric	MeanCos baseline	Spectral	What the metric says
Recall@1	0.044	0.356	Spectral surfaces relevant items much earlier.
Recall@2	0.078	0.651	The top of the ranking changes substantially.
Recall@5	0.181	0.816	Local attribute matching becomes much easier.
Recall@10	0.329	0.899	The accepted-plan headline gain is strongly supported.
Recall@20	0.565	0.953	Spectral remains ahead deeper into the list.
Success@2	0.005	0.506	Strictly retrieving both relevant items near the top improves sharply.
Success@5	0.036	0.731	The two-hit requirement shows the baseline’s weakness clearly.
Success@10	0.124	0.836	Spectral improves both-relevant retrieval, not just partial recall.
MRR	0.219	0.794	Relevant items move much closer to rank one.
MAP	0.166	0.734	Ranking quality improves across relevant items.
NDCG@10	0.194	0.789	The top-10 ordering becomes much more useful.

This is the paper’s strongest empirical section, but it must be read with its caveat attached. LIMIT-small is tiny: the paper notes a 46-document corpus and sets the candidate pool to cover the whole corpus. That means the experiment removes the first-stage recall problem. Spectral Retrieval is not being tested as a production two-stage system over millions of documents. It is being tested as an aggregation rule when all relevant documents are available for re-ranking.

That does not make the result unimportant. It makes the result specific. The LIMIT-small section supports the claim that local evidence can be recovered from per-token embeddings without retraining the encoder. It does not prove that the full pipeline will maintain the same gains when a first-stage ANN retriever retrieves only the top 100 candidates from a large corpus.

A useful way to classify the paper’s evidence is this:

Evidence component	Likely purpose	What it supports	What it does not prove
Endpoint derivation	Main mechanism	The scoring family contains both per-token and mean-pool behavior.	Ranking dominance over all baselines.
Synthetic single-spike benchmark	Main evidence for localization	Mean pooling can structurally miss a narrow signal that token-level scoring recovers.	Performance on natural corpora.
Width sweep	Robustness/sensitivity around signal width	Intermediate scales matter when evidence spans several positions.	Optimal scale grids for production corpora.
LIMIT-small with all-mpnet-base-v2	Real-encoder evaluation	Same encoder, different aggregation rule can sharply improve localized retrieval.	Large-scale RAG performance under candidate-pool constraints.
Latency and storage discussion	Implementation detail and deployment framing	The method belongs as a second-stage re-ranker and shares late-interaction storage costs.	A measured production latency benchmark.
Multi-agent security vignette	Exploratory application	Why role-specific agents may benefit from sharper retrieval windows.	End-to-end agent performance or debate-round reduction.

This distinction is not pedantry. It prevents a good paper from being turned into a bad sales deck.

The business value is sharper context selection, not magical RAG accuracy

For enterprise RAG, Spectral Retrieval is interesting because it targets a common operational pain: the answer is in the corpus, but retrieval hands the model the wrong surrounding context.

In a single-agent system, this causes familiar failures. The model receives a broadly related document, misses the decisive paragraph, and writes a plausible answer that cites the wrong evidence. In a multi-agent system, the problem compounds. A security agent, operations agent, and compliance agent may all retrieve the same generic incident documents because their mean-pooled vectors look similar, even though each agent needs a different subspan.

Spectral Retrieval’s business relevance follows a practical path:

Business documents often contain localized facts inside broad documents.
Mean-pooled dense retrieval can under-rank those documents because local facts are diluted.
Per-token embeddings preserve more location-specific evidence.
Multi-scale scoring can expose matches at token, short-span, or wider-span resolution.
Agents receive tighter context windows before reasoning or debate begins.
Better context selection may reduce irrelevant context scanning, citation confusion, and agent disagreement caused by retrieval noise.

Only the first four steps are directly shown by the paper. Steps five and six are reasonable business inferences, especially for RAG and multi-agent workflows, but they remain unproven as end-to-end deployment results.

The most natural near-term use case is not replacing every vector search pipeline. It is selective re-ranking for high-value queries where localized evidence matters: compliance review, security triage, technical support escalation, internal audit, legal research, incident postmortem search, and regulated customer-service workflows.

The ROI logic is also not “use Spectral because it is more accurate.” That is too lazy. The sharper claim is this: if a workflow currently spends expensive model tokens and analyst time on broad, noisy retrieval results, then a second-stage re-ranker that improves localized evidence ranking may reduce downstream review cost. The retrieval bill rises; the reasoning and human-verification bill may fall. Whether the trade is attractive depends on corpus size, query volume, latency budget, token cost, and the cost of wrong evidence.

The cost profile looks like late interaction, because it is one cousin of late interaction

The paper is clear that Spectral Retrieval needs per-token document embeddings. That is the central operational cost. Stored as fp16, the paper estimates per-token embeddings at 2 × N × d bytes per document, and gives an example of roughly 15 TB for a 50-million-document corpus with mean length 200 and embedding dimension 768. Compression methods similar to those used in ColBERTv2 can reduce that footprint, but the method is still far heavier than ordinary single-vector dense retrieval.

The compute side is less frightening if the system is built as intended. Spectral Retrieval re-ranks only first-stage candidates. The paper describes the second-stage complexity as scaling with candidate count, number of scales, document length, and embedding dimension. It also points to GPU-batched 1D convolution and FFT-based convolution for very long documents.

But there is a gap between theoretical plausibility and measured production readiness. In the LIMIT-small experiment, the paper reports an unoptimized CPU setting that is far slower than the baseline. The paper argues that GPU batching and normal candidate-pool sizing would bridge much of the gap. That is plausible engineering, not a completed benchmark.

For business adoption, that means Spectral Retrieval should be tested as a targeted re-ranking layer, not installed indiscriminately across all retrieval calls. A sensible deployment pattern would be:

Pipeline stage	Practical role	Why Spectral fits or does not fit
Fast ANN retrieval	Broad candidate recall	Keep existing single-vector search; Spectral does not replace it.
Spectral re-ranking	Promote candidates with localized evidence	Best fit for documents where relevant spans are short relative to document length.
Optional cross-encoder	Validate top survivors more deeply	Useful when false positives from max-style scoring are expensive.
MMR or diversity step	Reduce duplicate context	Helps avoid giving agents near-identical evidence packets.
LLM answer or agent debate	Reason over retrieved evidence	Benefits only if retrieval actually improves the supplied context.

The least attractive use cases are also clear: short FAQ-style documents, repetitive templates, queries so short that token-level noise dominates, and corpora where the first-stage retriever already fails to retrieve the relevant candidates. In those settings, Spectral Retrieval may add cost without much signal.

The failure modes are not footnotes; they are deployment design requirements

The paper’s limitations section is unusually useful because it names risks that production teams can actually monitor.

The first is index size. Per-token storage is the entry fee. Teams already struggling with vector index cost will not enjoy discovering that every token now wants its own seat at the table.

The second is the candidate-pool ceiling. Because Spectral Retrieval is a re-ranker, first-stage recall determines its maximum possible success. This is likely the most important next experiment: run the method on larger corpora where the candidate pool is limited, not conveniently equal to the full corpus.

The third is Goodhart-style outlier risk. A max over positions and scales may reward a single spurious high-similarity token inside an irrelevant document. If that happens often, strict max should be softened. Top-$k$ mean, percentile caps, or cross-encoder validation become not optional refinements but control mechanisms.

The fourth is encoder dependence. Spectral Retrieval changes the aggregation rule, not the underlying representation. If the encoder’s per-token geometry does not express the needed distinction, the spectral score cannot invent it. It can expose local signal already present in the token embeddings. It cannot manufacture semantic resolution from a representation that never encoded it.

The fifth is benchmark coverage. The paper does not run BEIR, MS MARCO, HotpotQA, LIMIT-full, or end-to-end multi-agent evaluations. This does not invalidate the current results, but it narrows them. The paper is best read as a mechanism and early evidence paper, not as a final retrieval leaderboard paper.

The real lesson: retrieval needs resolution control

Spectral Retrieval is useful because it turns a vague complaint — “our RAG system missed the relevant passage” — into a sharper engineering question: at what resolution did retrieval compare the query to the document?

Mean pooling compares at the document level. Per-token MaxSim compares at the token level. Spectral Retrieval asks the comparison to happen across a scale grid. That is the conceptual move. The sinc kernel is the implementation detail; the business lesson is resolution control.

For Cognaptus-style business automation, this matters most where AI agents are expected to act on narrow evidence inside messy documents. A compliance agent should not need to read an entire policy because the relevant clause was averaged into invisibility. A security triage agent should not receive the same generic incident context as the operations agent merely because the document-level vectors share broad vocabulary. A support agent should not cite the wrong page because the right fix was one paragraph too small to dominate the mean.

The paper does not solve retrieval. It does something more modest and more useful: it shows that one important class of retrieval failure can come from the scale at which evidence is aggregated. Once that is visible, the engineering agenda becomes concrete. Store token-level evidence where the use case justifies it. Re-rank only where localized facts matter. Add controls for outlier tokens. Test candidate-pool recall before celebrating. Then measure whether the improved retrieval actually reduces downstream reasoning cost and evidence-review burden.

Dense retrieval has spent years making documents easy to average. Spectral Retrieval is a reminder that evidence is often not average-shaped.

Cognaptus: Automate the Present, Incubate the Future.

Andrea Morandi, “Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems,” arXiv:2605.24764, 2026, https://arxiv.org/abs/2605.24764. ↩︎

Mean pooling is efficient because it forgets where the evidence was#

Spectral Retrieval changes the scale of comparison, not the encoder#

The endpoint guarantee is useful, but it is not a ranking guarantee#

The synthetic tests explain the mechanism, not the whole market#

LIMIT-small tests the aggregation rule under a real encoder#

The business value is sharper context selection, not magical RAG accuracy#

The cost profile looks like late interaction, because it is one cousin of late interaction#

The failure modes are not footnotes; they are deployment design requirements#

The real lesson: retrieval needs resolution control#