How Ultra-Large Context Windows Challenge RAG

TL;DR for operators

Ultra-large context windows are not a ceremonial funeral for retrieval-augmented generation. They are a price renegotiation.

If your task is to analyse a bounded, self-contained document set — a contract bundle, diligence folder, policy manual, code repository, or technical appendix — a long-context model may now be the cleaner first option. The main benefit is not that it “knows more”. It is that it can inspect more of the original evidence without depending on a retriever to guess which passages matter.

But if your corpus is large, live, permissioned, multilingual, noisy, or constantly changing, RAG remains structurally useful. It filters, routes, logs, updates, enforces access control, and keeps the prompt from becoming an expensive landfill. Charming, but not optional.

The real architecture shift is this: RAG is moving from “retrieve five chunks and pray” toward orchestration. Long context becomes the reader. Retrieval becomes the logistics layer. The winning system is not RAG versus long context. It is knowing when to read everything, when to retrieve selectively, and when to route between both.

The familiar problem is not memory; it is evidence handling

Picture a legal associate, a finance analyst, or a product manager with a folder full of documents and a question that sounds deceptively simple: “What changed?” “Where is the risk?” “Which clause contradicts the latest policy?” “Does this customer qualify?”

Traditional RAG answers that question by chopping the source material into pieces, embedding those pieces, retrieving a few that appear relevant, and passing them to the language model. This worked because earlier models had tiny working memories. The model could not read the room, so we built a librarian.

Ultra-large context windows change the labour division. Gemini 2.5 Pro, announced in March 2025, shipped with a one-million-token context window, with a two-million-token window described as coming soon.¹ That scale makes it plausible to feed a model whole books, large codebases, long policy sets, or substantial case files. Not always cheaply. Not always wisely. But plausibly.

That matters because many RAG failures are not generation failures. They are evidence-selection failures. The model may be perfectly capable of answering if it sees the right passage. The system fails because the retriever never brings that passage to the table. In business terms, the intern did not misread the contract. The intern was handed the wrong page.

What the paper actually shows: long context wins when the evidence is coherent

The most useful contribution of Long Context vs. RAG for LLMs: An Evaluation and Revisits is not a slogan about who wins. It is the paper’s attempt to explain why earlier studies reached inconsistent conclusions about long context and RAG.²

The authors review prior comparisons and then run a more careful evaluation. They filter out questions that could be answered from the model’s internal knowledge, compare retrieval methods, and examine how dataset construction affects results. That last point is easy to miss and expensive to ignore. A long-context benchmark is only meaningful if the “context” is actually relevant context, not a haystack inflated with synthetic noise so that everyone can admire the size of the haystack.

Their central result is nuanced. Long context generally performs better on question-answering benchmarks where the answer lives inside dense, self-contained material such as Wikipedia-style articles or books. Summarisation-based retrieval can be competitive with long context. Conventional chunk-based RAG is weaker. But RAG still shows advantages in dialogue-based and more general question settings, where information is fragmented and the task is less like reading one coherent dossier.

So the practical reading is not “long context beats RAG”. It is:

Situation	What the evidence suggests	Business interpretation	Boundary
Dense, self-contained documents	Long context tends to perform well	Feed the model the whole relevant bundle when cost permits	Requires bounded input and reliable context use
Conventional chunk-based RAG	Often underperforms stronger long-context setups	Naive chunking is a weak default	Better retrieval design can change the result
Summarisation-based retrieval	Can approach long-context performance	Pre-compression may be valuable when full reading is too expensive	Summaries can omit legally or operationally critical detail
Dialogue or fragmented information	RAG can retain advantages	Retrieval helps when evidence is scattered across events or records	Depends heavily on indexing and metadata quality
Very large or live corpora	Long context alone is impractical	Retrieval remains necessary as selection and governance infrastructure	Requires continual maintenance, not just embeddings

There is a useful discomfort here. The paper does not let either camp enjoy a clean victory lap. Long-context advocates must admit that bigger windows do not remove cost, latency, or attention problems. RAG advocates must admit that many deployed RAG systems are elaborate wrappers around mediocre chunk retrieval. Everyone gets a participation trophy, but only after reading the error analysis.

The wrong takeaway is “RAG is dead”

The likely misconception is obvious because the industry keeps trying to turn architecture into a boxing match. Bigger context arrives, and suddenly RAG is obsolete. Then a production team prices the inference bill, adds access controls, handles fresh documents, debugs citations, and quietly returns to retrieval. A moving ceremony, really.

Long context challenges RAG in one specific way: it weakens RAG’s monopoly over external knowledge injection. Previously, if the evidence exceeded the model’s input window, retrieval was not a design choice. It was the entrance fee. As context windows expand, some use cases no longer need a small-chunk retriever at all. The model can simply read the relevant file set.

But that only removes one job from RAG. It does not remove the other jobs.

RAG is not just “more tokens for the model”. In mature systems, retrieval also performs scoping, filtering, freshness management, source attribution, access control, cost reduction, and audit support. These are not decorative enterprise anxieties. They are the boring reasons systems survive procurement.

The better replacement for “RAG is dead” is: RAG is no longer the only way to fit evidence into the prompt. That is a smaller claim. It also happens to be true.

Bigger windows reduce retrieval miss, but they do not guarantee attention

A larger context window solves one problem directly: the system can include more source material. That reduces the chance that a relevant passage is excluded before the model begins reasoning. In domains where missing evidence is the dominant failure mode, this is powerful.

The mechanism is straightforward. Chunk-based retrieval converts a question into a similarity search problem. That works when the query and the evidence use similar language. It is fragile when the same idea appears under different terms, when the relevant answer requires multiple passages, or when a passage only becomes relevant after another passage has been read. Legal, medical, engineering, and finance documents specialise in this sort of unpleasantness. Naturally.

Long context changes the failure mode. Instead of asking, “Did retrieval fetch the right chunk?”, we ask, “Can the model identify and use the right evidence inside a large input?” That is better for some tasks, but it is not free.

The “Lost in the Middle” literature showed that models can use information less reliably when the relevant content appears deep inside long inputs rather than near the beginning or end.³ Newer long-context models are better, but the underlying lesson remains: context length is capacity, not comprehension. A warehouse is not a filing system.

For operators, this means prompt construction still matters. Document order, section labels, tables of contents, metadata, summaries, and question-specific preambles are not cosmetic. They are ways of making the model’s attention less wasteful. If you dump a million tokens into the model and call that architecture, congratulations: you have invented expensive fog.

RAG’s old chunk habit is the weak link

The strongest critique of RAG is not that retrieval is obsolete. It is that many RAG systems retrieve badly.

Short chunks were a rational adaptation to short-context models. They made passages small enough to embed, rank, and fit into prompts. But short chunks also strip evidence from its surrounding structure. A paragraph from a contract may depend on definitions twenty pages earlier. A policy exception may only make sense inside an approval workflow. A financial footnote may refer to a table, a schedule, and a prior-year restatement. Chunking can turn coherent evidence into confetti.

This is why some newer RAG work does not try to defeat long-context models by pretending nothing changed. It absorbs the lesson. LongRAG, for example, uses longer retrieval units and a long-context reader, grouping related documents into larger units rather than retrieving tiny passages.⁴ In its Wikipedia setting, the paper reports reducing retrieval units dramatically while improving retrieval performance, which is precisely the direction enterprise teams should notice: less microscopic search, more coherent evidence packaging.

OP-RAG makes a related point from another angle. It argues that order matters, and that preserving the original sequence of retrieved chunks can improve long-context answer generation.⁵ This should not be surprising. Documents are not bags of sentences. They are structured arguments, procedures, records, and exceptions. Retrieval that destroys order and then asks a model to reason over the debris has only itself to blame.

The operational implication is simple: if RAG survives, it survives by becoming less naive. Bigger retrieval units, order preservation, query-aware summarisation, metadata filtering, hierarchical indexes, and routing logic matter more as context windows grow. The retriever’s job is no longer to make a short-context model barely functional. Its job is to prepare a high-quality reading packet.

The cost question turns architecture into routing

Long context often buys accuracy by spending tokens. RAG often buys efficiency by risking omission. The interesting architecture is the one that can decide which trade-off is acceptable for the current query.

Self-Route is a useful example because it frames the problem as routing rather than ideology.⁶ The paper finds that long context can outperform RAG when sufficiently resourced, but RAG remains much cheaper. It then proposes routing queries between RAG and long-context processing based on model self-reflection, aiming to preserve much of the long-context performance while reducing cost.

That is closer to how production systems should behave. Not every question deserves a million-token reading session. Some questions are lookup tasks. Some are synthesis tasks. Some require freshness. Some require exhaustive review. Some require a traceable answer from a narrow permissioned source. Treating them all the same is not architecture; it is billing malpractice.

A practical routing layer might look like this:

Query type	Preferred first move	Why
Exact lookup in a large live corpus	RAG	Fast, cheap, fresh, auditable
Review of a bounded document bundle	Long context	Avoids retrieval miss and preserves structure
Multi-document synthesis with known sources	Hybrid	Retrieve/select first, then read larger packets
Dialogue history or fragmented records	RAG or hybrid	Evidence is distributed and often metadata-dependent
High-stakes review requiring traceability	Hybrid	Combine source control, retrieval logs, and broad reading
Exploratory analysis over static files	Long context	Lets the model discover connections retrieval may miss

This is where the business relevance becomes concrete. The choice is not a philosophical preference for “retrieval” or “context”. It is a cost-risk decision. What is more expensive: reading too much, or missing the one passage that matters?

What Cognaptus would infer for enterprise AI

The paper directly shows that long context can outperform conventional RAG in several text QA settings, especially when the evidence is coherent and sufficiently relevant. It also shows that retrieval method matters: summarisation-based retrieval and better context handling can narrow the gap.

Cognaptus would infer three business design principles from that evidence.

First, stop treating vector search as the default centre of every enterprise AI system. For bounded document workflows, long-context reading may be simpler, more accurate, and easier to explain. Contract review, board-pack analysis, policy comparison, technical due diligence, and codebase inspection are obvious candidates.

Second, redesign RAG around evidence quality rather than chunk count. The question should not be “How many chunks do we retrieve?” It should be “What is the smallest coherent evidence packet that preserves the reasoning structure?” That packet may be a section, a document group, a timeline, a summary plus source excerpts, or a permission-filtered bundle.

Third, build routing early. A production assistant should know when to retrieve, when to read broadly, when to ask for source narrowing, and when to refuse because the evidence base is stale or inaccessible. This is not a glamorous feature. It is the difference between a demo and a system that does not embarrass itself in front of auditors.

Where the result does not travel cleanly

The paper’s boundary conditions matter. Its analysis focuses on text-based long contexts. That does not automatically settle multimodal enterprise workflows involving video, audio, scanned documents, diagrams, or mixed data systems. Multimodal long-context retrieval is a different swamp, with nicer marketing decks.

The evaluation also depends on available benchmarks. Many long-context benchmarks still struggle with context relevance, dataset size, and synthetic construction. If a benchmark creates length by adding irrelevant noise, it may reward different behaviour from a real enterprise task where every page is dense, privileged, and potentially important.

Finally, the business environment adds constraints the benchmark does not fully model: data freshness, role-based access, retention policy, cost ceilings, latency targets, source attribution, and integration with operational systems. A long-context model can read a large file. It cannot by itself decide who is allowed to see the file, whether the file is current, or whether the answer should be logged for compliance review. That dull machinery is where many AI systems either become useful or become a procurement anecdote.

RAG gets promoted, not buried

Ultra-large context windows challenge RAG by removing its weakest justification: “we need retrieval because the model cannot fit the evidence.” Sometimes the model now can. That is a real shift.

But the stronger version of RAG was never just about fitting text into a prompt. It was about selecting, governing, updating, and grounding evidence. As context windows expand, retrieval becomes less like a workaround and more like an orchestration layer. Smaller role, higher standards. Painful, but character-building.

For AI teams, the sensible strategy is not to pick a tribe. Use long context when the evidence is bounded and coherent. Use RAG when the corpus is large, live, fragmented, or permissioned. Use hybrid routing when accuracy, cost, and auditability all matter at once, which is to say: in most serious businesses.

The context window got bigger. The architecture problem got more interesting. Sorry about that.

Cognaptus: Automate the Present, Incubate the Future.

Google, “Gemini 2.5: Our most intelligent AI model,” Google Blog, March 25, 2025. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-model-thinking-updates-march-2025/ ↩︎
Xinze Li, Yixin Cao, Yubo Ma, and Aixin Sun, “Long Context vs. RAG for LLMs: An Evaluation and Revisits,” arXiv:2501.01880, 2024. https://arxiv.org/abs/2501.01880 ↩︎
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang, “Lost in the Middle: How Language Models Use Long Contexts,” arXiv:2307.03172, 2023. https://arxiv.org/abs/2307.03172 ↩︎
Ziyan Jiang, Xueguang Ma, and Wenhu Chen, “LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs,” arXiv:2406.15319, 2024. https://arxiv.org/abs/2406.15319 ↩︎
Tong Yu et al., “In Defense of RAG in the Era of Long-Context Language Models,” arXiv:2409.01666, 2024. https://arxiv.org/abs/2409.01666 ↩︎
Zhepei Li et al., “Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach,” arXiv:2407.16833, 2024. https://arxiv.org/abs/2407.16833 ↩︎

TL;DR for operators#

The familiar problem is not memory; it is evidence handling#

What the paper actually shows: long context wins when the evidence is coherent#

The wrong takeaway is “RAG is dead”#

Bigger windows reduce retrieval miss, but they do not guarantee attention#

RAG’s old chunk habit is the weak link#

The cost question turns architecture into routing#

What Cognaptus would infer for enterprise AI#

Where the result does not travel cleanly#

RAG gets promoted, not buried#