Paper: Open-Source Agentic Hybrid RAG Framework for Scientific Literature Review (Nagori et al., 2025)
One‑line: The authors wrap a hybrid RAG pipeline (Neo4j GraphRAG + FAISS VectorRAG) inside an agent (Llama‑3.3‑70B) that decides per query which retriever to use, then instruction‑tunes generation (Mistral‑7B) and quantifies uncertainty via bootstrapped evaluation. It’s open‑source and genuinely useful.
Why this paper matters (beyond research circles)
- Business pain: Knowledge workers drown in PDFs. Static “semantic search + summarize” tools miss citation structure and provenance; worse, they hallucinate under pressure.
- What’s new: Dynamic query routing between graph queries (Cypher over Neo4j) and semantic + keyword retrieval (FAISS + BM25 + rerank). Then DPO nudges the generator to prefer grounded answers.
- So what: For regulated sectors (healthcare, finance, legal), this is a pattern you can implement today for auditable reviews with traceable sources and tunable confidence bands.
The blueprint (concrete, reproducible)
Ingestion: Pull bibliometrics (DOI, title, abstract, year, authors, PDF URL, source) from PubMed, arXiv, Google Scholar. Deduplicate and filter by cosine similarity of TF‑IDF keywords (keep top‑quartile relevance).
Two stores, two superpowers:
- Neo4j Knowledge Graph (KG): Nodes = publication, author, year, database, keyword; Edges = authored_by, published_in, has_keyword, cites. Title/DOI/abstract live as attributes on the publication node.
- FAISS Vector Store (VS): Full‑text chunks (≈ 2,024 chars with 50‑char overlap) embedded by all‑MiniLM‑L6‑v2. Retrieval = BM25 (sparse) + semantic search, merged then reranked (Cohere
rerank-english-v3.0
).
Agent & Generation:
- Router: Llama‑3.3‑70B‑versatile chooses GraphRAG (NL → Cypher with few‑shot schema exemplars) or VectorRAG (sparse+dense+rerank). Ten few‑shot exemplars calibrate tool choice and formatting.
- Writer: Mistral‑7B‑Instruct‑v0.3, instruction‑tuned and further DPO‑aligned on 15 preference pairs to bias toward context‑grounded statements.
Uncertainty: 12‑fold bootstrap over mixed KG/VS questions reports mean ± error bars for Faithfulness, Answer Relevance, Context Precision, Context Recall.
When to go Graph vs. Vector (a practical playbook)
Query pattern | Use GraphRAG (Cypher) | Use VectorRAG (semantic) |
---|---|---|
Who/where/when with metadata constraints (authors, years, venues, co‑authorship, citation paths) | ✅ Directly answer via KG nodes/edges; multi‑hop joins are cheap | ❌ Often noisy; metadata buried in prose |
Conceptual / mechanism / methods inside full text | ❌ KG may not encode nuance | ✅ Chunks capture explanation detail |
Fact check like “Does Paper A cite Paper B?” | ✅ MATCH (a)-[:cites]->(b) |
⚠️ May miss if citation not in extracted chunk |
“What used method X and achieved outcome Y?” | ✅ If method/outcome modeled as keywords or nodes | ✅ If details live in methods sections |
Open‑ended synthesis (“themes across 2024–2025 MLLMs in healthcare”) | ⚠️ Good for scoping cohorts first | ✅ Better for passages + synthesis |
Routing heuristic (good enough to start):
- If the prompt mentions authors/years/venues/citations/keywords → GraphRAG.
- If it asks why/how/comparison/limitations → VectorRAG.
- For mixed queries, GraphRAG to scope, VectorRAG to explain; then merge.
Results that matter (read as impact, not scoreboard)
Metric (↑ better) | Agentic + DPO vs Baseline | Takeaway for teams |
---|---|---|
VS Context Recall | +0.63 | Pulls in far more of the actually relevant passages—fewer “I didn’t see that in context” moments. |
Overall Context Precision | +0.56 | Less junk in the prompt window → cheaper, faster, crisper answers. |
VS Faithfulness | +0.24 | Fewer hallucinations; better quote‑to‑claim alignment. |
VS Precision / KG Answer Relevance | +0.12 each | More on‑target chunks and better fit to the question. |
Overall Faithfulness | +0.11 | Safer outputs for compliance reviews. |
KG Context Recall | +0.05 | Slight lift, suggests room to improve Cypher generation. |
A caveat worth noting: on KG‑specific questions, DPO showed small drops in Precision and Faithfulness vs the untuned agent—likely a side effect of the tiny preference set and instruction bias. Fine‑tuning the Cypher translator would likely flip this.
What we’d keep, change, and challenge
Keep
- The two‑store architecture: relations in KG, semantics in vectors.
- Rerank after merge: sparse + dense before the LLM is a winning, cheap guardrail.
- Bootstrap uncertainty: treat eval like product telemetry, not a one‑off benchmark.
Change next
- Fine‑tune NL→Cypher on (query, Cypher) pairs; keep few‑shot as fallback.
- Add OCR for scanned PDFs; you’ll widen coverage and reduce VS blind spots.
- Expand DPO pairs and separate heads: one head for KG‑grounded answers, one for VS.
Challenge
- Synthetic benchmark is necessary, but not sufficient. Add domain suites (e.g., trial registries, claims data abstracts, 10‑K risk factors). Include table/figure‑aware questions and multi‑hop chains mixing KG + VS.
Implementation notes we wish someone told us sooner
- Chunking: 2,024‑char chunks with 50‑char overlap strike a balance; monitor duplicate‑ish chunks after rerank.
- Keyword filters: The TF‑IDF top‑quartile cosine filter keeps noise down early; still, log rejects for audit.
- Latency: Consumer CPU ≈ ~2 minutes per complex query; GPU server brings it ≈ ~10 seconds. Budget UX accordingly.
- Observability: Log router decision, retriever hits, rerank scores, and grounding citations; expose them to users.
- Guardrails: Enforce “no cite, no claim” in prompts; penalize unsupported statements via DPO or a post‑hoc checker.
A 2‑week rollout plan (Cognaptus‑ready)
- Data taps: connectors to PubMed/arXiv/Scholar; nightly refresh; dedupe + TF‑IDF quartile cut.
- Graph: instantiate Neo4j schema (pub/author/year/keyword/database; cites/has_keyword/authored_by/published_in).
- Vectors: MiniLM‑L6‑v2 embeddings into FAISS; add BM25 index; unify retriever interface.
- Router: Llama‑3.3‑70B with ~10 curated few‑shot routes; fall back to “scope with KG, explain with VS”.
- Generator: start with Mistral‑7B‑Instruct; layer in DPO on 50–100 high‑quality pairs from real stakeholder Q&A.
- Eval harness: bring the paper’s bootstrap script; track F/AR/CP/CR per domain; ship a weekly dashboard.
- Pilot: pick one vertical (e.g., clinical AI or fintech reg‑tech); define 25 canonical questions; iterate on miss cases.
Build vs. buy (a frank, CFO‑friendly take)
- Buy a static RAG tool if: your questions are shallow, audit needs are light, and you mostly want summaries.
- Build this agentic hybrid if: you need traceability, policy‑driven routing, and measurable reliability. TCO is higher upfront, but search‑to‑insight cycle time and risk of wrong answers drop materially.
Integration fit for Cognaptus: Plug the router into our Insights content engine; preserve KG/VS provenance in published posts; expose a “Show my sources” toggle and confidence band. For clients, ship a managed Neo4j+FAISS stack with domain packs (e.g., healthcare, fintech, supply chain).
Quickstart (minimal pseudo‑workflow)
User Query → Router (Llama‑3.3) → {GraphRAG | VectorRAG}
GraphRAG: NLQ → Cypher (few‑shot) → Neo4j → rows → Writer (Mistral‑7B)
VectorRAG: BM25 + MiniLM‑L6‑v2 → Merge → Cohere Rerank → top‑k → Writer
Post‑hoc: Faithfulness check + cite spans → Bootstrap logging
Bottom line
Agentic routing beats one‑size‑fits‑all RAG. This paper’s open stack (Neo4j + FAISS + Llama‑3.3 + Mistral‑7B + DPO) proves you can get higher recall, cleaner context, and fewer hallucinations—with uncertainty you can show the auditor. If your team publishes, audits, or decides off PDFs, this is the next architecture to pilot.
Code: github.com/Kamaleswaran-Lab/Agentic-Hybrid-Rag
Cognaptus: Automate the Present, Incubate the Future