GraphRAG Without the Drag: Scaling Knowledge-Augmented LLMs to Web-Scale

TL;DR for operators

GraphRAG usually sounds like a clean enterprise promise: put your knowledge into a graph, attach it to a language model, and enjoy more grounded answers. The less glamorous truth is that someone has to build the graph. At web scale, that “someone” is usually an LLM being asked to extract triples from millions or billions of passages, which is a fine idea if the procurement team has recently discovered oil under the server room.

The paper behind this article, Millions of GeAR-s: Extending GraphRAG to Millions of Documents, tackles that cost bottleneck directly.¹ Instead of extracting triples from every passage in FineWeb-10BT before retrieval, the system starts with ordinary retrieval, asks a model to extract only query-relevant “proximal” triples from the passages it has already found, maps those triples to Wikidata triples using sparse retrieval, expands through the external graph, and then retrieves more passages from the expanded graph path.

The reported LiveRAG preliminary results are useful but uneven: correctness reaches 0.875714, while faithfulness is 0.529335. That gap matters. It suggests the system can often land on good answers, but the grounding trail is not yet reliable enough to treat the graph layer as a fully trusted evidence mechanism.

For business use, the takeaway is architectural rather than celebratory. Online graph augmentation can make GraphRAG more feasible over huge corpora because it avoids full-corpus LLM extraction. But the shortcut creates a new risk: loose alignment between text triples and external KG triples can drift off-topic. In the paper’s own case study, geoduck reproduction slides toward Pacific oyster literature, and hot tub heater troubleshooting slides into unrelated geography, biology, and publication metadata. Elegant? No. Instructive? Absolutely.

The practical message: use GraphRAG selectively where multi-hop retrieval really matters; validate alignment as a first-class system component; and do not confuse “we used a knowledge graph” with “the answer is grounded.” The graph can be a map. It can also be a very confident detour.

The real bottleneck is not retrieval; it is graph construction

The usual GraphRAG pitch begins with retrieval. A user asks a question, the system retrieves relevant documents, a graph helps discover multi-hop evidence, and an LLM writes the answer. This is a reasonable story. It is also slightly too neat, which is how many AI architecture diagrams earn their living.

The harder part sits one layer below the diagram: how do the documents become graph nodes and edges in the first place?

State-of-the-art GraphRAG systems often rely on triples extracted from the source passages. A triple is an atomic subject-predicate-object fact, such as:

(“University of Southampton”, “founded in”, “1862”)

Once the system has many such triples, it can connect facts that share entities, search over reasoning paths, and retrieve evidence that ordinary keyword or vector retrieval might miss. This is especially useful for multi-hop questions, where the answer is not sitting in one obvious passage.

That mechanism works cleanly when the corpus is manageable. The paper notes that prior GraphRAG systems have usually been tested on collections up to hundreds of thousands of passages. LiveRAG changes the scale: the challenge uses FineWeb-10BT chunks, pushing the problem toward millions of passages. At that size, running an LLM across the entire corpus to extract triples becomes the operational choke point.

The contribution of this paper is therefore not “GraphRAG, but bigger.” Bigger is not an architecture. Bigger is a bill.

The contribution is a workaround: keep the graph advantage, but avoid building a full passage-derived graph offline. The system adapts GeAR, a graph-enhanced agent for RAG, so that it can operate with an external knowledge graph—Wikidata—rather than a complete triple index extracted from FineWeb itself.

The mechanism: retrieve first, align only what the query touches

The adapted system uses a mechanism-first compromise. It does not try to understand the whole corpus in advance. It waits until a question arrives, retrieves initial passages, extracts local facts from those passages, and then uses those facts as bridges into Wikidata.

The pipeline can be read as five movements:

Stage	What the system does	Why it exists	Operational risk
Baseline retrieval	Uses dense and sparse retrieval over FineWeb passages, combined with Reciprocal Rank Fusion	Gets an initial evidence set without requiring graph construction	May miss multi-hop evidence
Reading into triples	Uses Falcon3-10B-Instruct to extract query-relevant proximal triples from retrieved passages	Turns text into structured hints	Extracted triples may be incomplete
External KG linking	Maps proximal triples to Wikidata triples using sparse similarity	Avoids offline LLM extraction over the entire corpus	Sparse matching can link to the wrong topic
Graph expansion	Expands through Wikidata using diverse triple beam search	Finds more distant reasoning paths	Expansion can amplify initial misalignment
Passage return and answer	Maps graph-expanded triples back to passages, filters irrelevant passages, and answers with passages plus triple memory	Converts graph movement back into text evidence	Final answer may be correct but weakly faithful

This is the paper’s core idea. The graph is not built from the web corpus. It is borrowed, queried, and loosely aligned only when needed.

That matters because most enterprise RAG systems face the same economic shape. They have too many documents to deeply pre-process, too many updates to keep a derived graph fresh, and too many domains where full schema design becomes a committee sport. Online alignment is attractive because it narrows the expensive work to the subset of evidence touched by the query.

But “online” does not mean free. It shifts cost from offline corpus processing into runtime retrieval, LLM-based reading, graph matching, expansion, filtering, and answer generation. This is not a magic reduction; it is a reallocation. The engineering question becomes whether the system spends more compute only when the question deserves it.

The authors make exactly that kind of practical adjustment. They set the maximum number of retrieval steps to two and observe that graph expansion gives limited benefit for simpler DataMorgana-generated questions. So the implementation skips Wikidata triples and graph expansion in the first iteration, using the fuller graph-enhanced pipeline only when additional reasoning steps are needed.

That is an important operational design choice. It says: do not make every question pay the multi-hop tax.

Why pseudo-alignment is both the trick and the trap

The paper’s most important word is not “graph.” It is “pseudo-aligning.”

In the original GeAR-style setting, passages and triples are explicitly associated: the triples are extracted from the passages, so the system knows which textual chunks support which graph facts. In this LiveRAG adaptation, that direct association is missing. The FineWeb chunks and Wikidata triples live in separate spaces. The system therefore creates a loose online association.

First, it retrieves passages relevant to the current query. Then Falcon3-10B-Instruct reads those passages and produces proximal triples—facts that appear useful for answering the question. Those proximal triples are then linked to the most similar Wikidata triples through sparse retrieval. The graph expands from those Wikidata triples. Finally, the expanded triples are mapped back to FineWeb passages, again using retrieval.

This is clever because it avoids an enormous offline extraction job. It is fragile because every bridge is approximate.

The system is essentially saying:

“This passage suggests these facts. These facts resemble these Wikidata triples. These Wikidata triples lead to these other graph facts. These graph facts can retrieve these other passages. Now let us answer.”

That chain can work. It can also wobble.

The wobble is not a vague limitation. The paper gives concrete examples. For a question about whether frilled lizards and geoducks share reproductive characteristics, the proximal triples extracted from FineWeb are on-topic: Pacific geoducks, external fertilization, broadcast spawning, larval development. But after linking to Wikidata, the system drifts into Pacific oyster papers, egg consumption studies, marine seaweeds, hermit crab fertilization, and even “Pacific” as a place associated with Long Beach.

That is not a small formatting issue. It is a semantic tax.

The second example is even less graceful. A user asks why a hot tub heater’s high limit switch has to be reset after draining and refilling the spa. The proximal triples are practical and relevant: faulty parts, high limit switches tripping, thermostats, clogged vents, thermistors. The linked Wikidata triples then wander into thermoluminescence dosimeters, Bacillus subtilis spores, mineral deposits, nosocomial pathogens, geological surveys, water-power flow, and hot tub-associated dermatitis.

The retrieval did not simply fail. It failed with a veneer of structure.

This is why GraphRAG can be dangerous when sold as inherently more grounded. A graph can make reasoning paths more inspectable, but only if the edges correspond to the right evidence. A wrong graph edge is not better than a missing document. It is a more respectable-looking mistake.

The reported scores say “promising retrieval,” not “solved grounding”

The paper reports preliminary automatic evaluation results for the “Graph-Enhanced RAG” LiveRAG submission: correctness of 0.875714 and faithfulness of 0.529335.

Those two numbers should be read together, not separately.

Correctness measures whether the answer is judged right. Faithfulness measures whether the answer is properly supported by the retrieved evidence or context. A system can be correct for the wrong reasons, correct because the LLM already knows the answer, or correct while citing weakly relevant passages. This is not an academic nitpick. In enterprise settings, the difference between “right answer” and “auditable answer” is the difference between useful automation and a governance headache with a user interface.

Result	Likely purpose in the paper	What it supports	What it does not prove
Correctness = 0.875714	Main preliminary evidence from LiveRAG evaluation	The adapted pipeline can produce many correct answers under the challenge setup	That graph alignment is consistently reliable
Faithfulness = 0.529335	Main preliminary evidence, especially relevant to grounding	Grounding quality remains materially weaker than answer correctness	That the system is ready for high-trust deployment
DataMorgana question taxonomy	Evaluation construction detail	The authors tested across varied question and answer types, including multi-hop path-following and path-finding	That real enterprise query distributions are covered
Misalignment case study	Diagnostic evidence / failure analysis	Sparse text-to-KG linking can drift off-topic even when proximal triples are good	The frequency of these failures across all query types

The correctness score is encouraging. It suggests that the architecture is not merely decorative. The model, baseline retriever, graph expansion, and filtering can combine into useful answers.

The faithfulness score is the warning label. If the system is right but its evidence trail is unreliable, then the graph layer is helping retrieval more than it is guaranteeing grounding. That may still be valuable, but it changes how the system should be deployed.

For low-risk exploratory search, this may be acceptable. For regulated advisory, technical support, legal discovery, safety-critical operations, or financial decision workflows, it is not enough to be often correct. The system must also show its work without smuggling in a detour through Pacific oysters.

DataMorgana is evaluation scaffolding, not the business environment

The experiments use DataMorgana to construct a sample of questions. The paper follows the suggested methodology by splitting users into novice and expert groups with equal probability, then defining categories for question formulation, premise presence, and answer type.

The taxonomy includes concise questions, verbose questions, list-based questions, definition questions, opinion-seeking questions, hypotheticals, how-to questions, and yes/no questions. It also includes answer types such as factoid, multi-aspect, comparison, path-following, and path-finding.

This matters because GraphRAG’s value is not uniform across query types. If a question is simple and single-hop, baseline retrieval may already be enough. If a question requires following or discovering a path across entities, graph expansion becomes more relevant. The authors’ decision to add path-following and path-finding categories is therefore not cosmetic. It targets the kind of question where GraphRAG should earn its keep.

But this is still evaluation scaffolding. It is not a production workload study.

A corporate knowledge base will have its own query distribution: repetitive support questions, policy lookups, contract clause comparisons, product compatibility checks, compliance interpretations, postmortem searches, engineering dependency analysis, and user-specific context. Some will benefit from graph reasoning. Many will not. A system that applies graph expansion indiscriminately may burn latency and money on questions that could have been answered by a plain hybrid retriever.

The paper implicitly points toward a more practical design: route the query first, graph-expand only when the question structure justifies it, and measure faithfulness separately from correctness.

That last part deserves emphasis. If the dashboard only reports answer accuracy, the system may look healthier than it is. Faithfulness is the audit trail. Ignore it and the graph becomes expensive mood lighting.

The business value is selective graph augmentation, not graph maximalism

The commercial temptation is obvious. “GraphRAG at web scale” sounds like something one can put on a slide, ideally with arrows, gradients, and the word “enterprise” appearing several times in morally questionable places.

The useful business lesson is narrower and more valuable: graph augmentation should be selective, online, and validated.

This paper shows a path for organisations that cannot afford, justify, or maintain full LLM-based triple extraction over every document. Instead of pre-processing the entire corpus into a knowledge graph, they can build systems that use conventional retrieval first, extract structured hints only from retrieved evidence, and consult an external or internal graph only when the query requires multi-hop support.

For enterprise deployments, the same principle could apply beyond Wikidata:

Enterprise graph source	Possible use	Alignment concern
Product catalogue graph	Compatibility, variants, bundles, replacement parts	Product names and aliases drift across regions
Customer/account graph	Relationship-aware support and sales workflows	Privacy boundaries and stale account states
Policy/process graph	Compliance and procedure navigation	Policy clauses may map to wrong operational contexts
Code dependency graph	Incident analysis and technical troubleshooting	Symbol names may collide across repositories
Supplier and contract graph	Procurement risk and obligation lookup	Legal entity names and contract versions may misalign

The ROI story is not “graphs make LLMs smarter.” That is far too blunt.

The ROI story is:

Avoid full-corpus extraction where it is economically irrational.
Spend graph reasoning only on queries where ordinary retrieval is likely insufficient.
Improve recall for multi-hop questions.
Keep evidence filtering and alignment validation as core controls.
Track faithfulness as a production metric, not a research afterthought.

This turns GraphRAG from a grand infrastructure rebuild into a targeted retrieval enhancement. Less glamorous, more deployable. A terrible fate for a buzzword, but a useful one for operators.

The alignment layer deserves its own product requirements

The paper’s failure cases reveal where enterprise teams should focus their design reviews. The fragile part is not the LLM answer prompt. It is the alignment layer between text evidence and graph facts.

In a production system, that layer needs explicit requirements:

Requirement	Why it matters	Example control
Entity disambiguation	Similar words can refer to different entities	Require entity IDs, aliases, and domain constraints
Relation compatibility	Similar subjects with wrong predicates produce misleading paths	Score predicate-level match, not just text overlap
Evidence round-trip	Graph facts must map back to supporting passages	Reject graph expansions that cannot retrieve credible text evidence
Domain gating	External KGs contain irrelevant but textually similar facts	Restrict graph search by domain, source, or entity type
Faithfulness monitoring	Correct answers can hide weak evidence	Report evidence support separately from answer accuracy
Human-auditable traces	Operators need to inspect why the system retrieved a path	Store query, proximal triples, linked triples, retrieved passages, and filters

Sparse retrieval is a reasonable baseline. The authors say as much. It is simple, cheap, and surprisingly effective. But the case study shows the ceiling. Sparse similarity can connect by surface overlap while missing the pragmatic meaning of the question.

This is why the paper’s conclusion calls for improved asymmetric semantic models that can operate in a shared semantic space for graph data and text. The word “asymmetric” is doing real work here. A text passage and a KG triple are not the same object. Matching them requires more than embedding both and hoping they become friends.

A passage is contextual, verbose, and often messy. A triple is compact, formal, and stripped of surrounding meaning. A good alignment model needs to understand that asymmetry rather than pretending it is just another nearest-neighbour search problem.

Where this approach applies—and where it should wait

This system is best understood as a research-grade adaptation for scaling GraphRAG under practical constraints. It is not a universal answer engine.

It applies most naturally when:

the corpus is too large for full offline triple extraction;
the organisation has or can access a useful graph-like knowledge source;
queries often require multi-hop reasoning;
latency budgets allow iterative retrieval;
answer auditability matters, but the system is not yet the final authority;
alignment can be monitored, sampled, and improved over time.

It is weaker when:

most queries are simple lookups;
the external graph is poorly aligned with the document domain;
entity names are ambiguous or overloaded;
the cost of a misleading evidence trail is high;
production teams cannot inspect intermediate triples and passage mappings;
faithfulness must be high from day one.

For Cognaptus-style automation work, the most relevant use case is not replacing a standard RAG pipeline wholesale. It is adding a graph-enhanced retrieval path for cases where baseline retrieval repeatedly fails: cross-document comparisons, entity relationship questions, process dependency checks, and investigations where the answer requires connecting facts rather than retrieving a single paragraph.

That means the right architecture is probably conditional:

Try hybrid retrieval.
Estimate whether the query is single-hop or multi-hop.
If multi-hop, extract proximal triples from the top passages.
Link to a controlled graph.
Expand cautiously.
Retrieve supporting passages again.
Filter aggressively.
Answer only with traceable evidence.
Log alignment failures as training data.

In other words, use the graph as a specialist, not as a lifestyle choice.

The paper’s quiet lesson: GraphRAG needs governance before grandeur

The paper is modest in the right way. It does not claim to solve web-scale GraphRAG. It explores how a state-of-the-art GraphRAG method can be adapted to millions of passages, reports promising preliminary performance, and then shows the mechanism breaking in a way that matters.

That is more useful than a clean victory lap.

The important shift is conceptual. Scaling GraphRAG is not just a matter of bigger indices, larger models, or more ambitious knowledge graphs. It is a matter of preserving trustworthy alignment between the question, the retrieved text, the extracted triples, the external graph, the expanded reasoning path, and the final answer.

Break that chain and the system may still sound intelligent. It may even be correct often enough to pass a casual demo. But the evidence trail becomes unstable, and unstable evidence is exactly where enterprise AI projects go to acquire risk committees.

The best reading of this paper is therefore practical: GraphRAG can be made cheaper by avoiding full offline extraction, but the price of that efficiency is a new alignment problem. Teams that understand the trade-off can build useful systems. Teams that only hear “web-scale GraphRAG” will build something with a very impressive architecture diagram and a troubling relationship with reality.

The graph is not the answer. The alignment is.

Cognaptus: Automate the Present, Incubate the Future.

Zhili Shen, Chenxin Diao, Pascual Merita, Pavlos Vougiouklis, and Jeff Z. Pan, “Millions of GeAR-s: Extending GraphRAG to Millions of Documents,” arXiv:2507.17399, 2025. ↩︎

TL;DR for operators#

The real bottleneck is not retrieval; it is graph construction#

The mechanism: retrieve first, align only what the query touches#

Why pseudo-alignment is both the trick and the trap#

The reported scores say “promising retrieval,” not “solved grounding”#

DataMorgana is evaluation scaffolding, not the business environment#

The business value is selective graph augmentation, not graph maximalism#

The alignment layer deserves its own product requirements#

Where this approach applies—and where it should wait#

The paper’s quiet lesson: GraphRAG needs governance before grandeur#