When it comes to retrieval-augmented generation (RAG), size matters—but not in the way you might think.
Most high-performing GraphRAG systems extract structured triples (subject, predicate, object) from texts using large language models (LLMs), then link them to form reasoning chains. But this method doesn’t scale: if your corpus contains millions of documents, pre-processing every one with an LLM becomes prohibitively expensive.
That’s the bottleneck the authors of “Millions of GeAR-s” set out to solve. And their solution is elegant: skip the LLM-heavy preprocessing entirely, and use existing knowledge graphs (like Wikidata) as a reasoning scaffold.
The Core Idea: Proxy Graph Reasoning
Instead of extracting triples from every document, their system (a modified version of GeAR) performs the following steps:
-
Initial Retrieval: Use hybrid BM25 and dense retrieval (via Reciprocal Rank Fusion) to get top passages relevant to the query.
-
Triple Extraction on-the-fly: Use LLMs (e.g., Falcon3B-Instruct) to extract proximal triples from only these top passages.
-
Wikidata Alignment: Match each triple to a similar triple in Wikidata using sparse vector search. These matched triples become the graph backbone.
-
Graph Expansion: Perform beam search over the Wikidata triples to build a multi-hop reasoning chain.
-
Re-Retrieve: Use these reasoning chains to retrieve more distant but relevant passages, merging them back into the document pool.
-
Iterate: If the query still can’t be answered, rewrite the query using the LLM and repeat.
All without running an LLM across the entire corpus. That’s the magic.
Why This Matters: RAG at Web Scale
Typical GraphRAG systems top out at a few hundred thousand passages. Here, the authors demonstrate scaling to millions of documents using a clever online alignment mechanism.
This isn’t just an engineering hack. It represents a shift in how we think about knowledge grounding. By using Wikidata as a semantic anchor, the system avoids expensive operations and taps into existing structured knowledge.
A nice touch is the agentic loop in GeAR: at each turn, the system determines whether it has enough evidence to answer the question, and if not, it decomposes and reformulates the query. It’s not just retrieval—it’s a controlled reasoning process.
The Cracks in the Armor
The tradeoff, of course, is alignment quality.
As Table 2 in the paper shows, semantic drift happens. For example:
- The system tries to answer a question about geoduck reproduction using Wikidata entries about oysters.
- A hot tub question gets linked to obscure heat-related journal papers.
This misalignment stems from the looseness of the sparse retrieval-based triple matching. There’s no guarantee that the linked triple shares the exact subject context. That’s a hard problem, and it limits the faithfulness of the final answer.
The authors suggest what future work should pursue: asymmetric semantic models that can represent both text passages and graph triples in a shared reasoning space. Until then, they accept some inaccuracy as the price of scalability.
Business Relevance
For any organization considering RAG at scale—say, enterprise search, legal document Q&A, or customer support across massive archives—this paper offers a critical insight:
Graph-enhanced reasoning doesn’t require graph-extracted corpora.
By piggybacking on existing graphs and using smart online alignment, you can get most of the reasoning benefits without paying all of the LLM costs.
That’s a serious value proposition.
Cognaptus: Automate the Present, Incubate the Future.