Judgment Day for RAG: How L‑MARS Cuts Legal Hallucinations by Design

TL;DR — L‑MARS replaces single‑pass RAG with a judge‑in‑the‑loop multi‑agent workflow that iteratively searches, checks sufficiency (jurisdiction, date, authority), and only then answers. On a 200‑question LegalSearchQA benchmark of current‑year questions, it reports major gains vs. pure LLMs, at the cost of latency. For regulated industries, the architecture—not just the model—does the heavy lifting.

What’s actually new here

Most legal QA failures aren’t from weak language skills—they’re from missing or outdated authority. L‑MARS tackles this with three design commitments:

Agentic, targeted search across heterogeneous sources (web via Serper, local RAG, CourtListener case law) rather than one‑shot retrieval.
A Judge Agent with a stop rule that refuses to answer until evidence is sufficient (authority, jurisdiction, temporal validity, contradiction scan).
Dual operating modes: a low‑latency Simple Mode and a more accurate Multi‑Turn Mode with query decomposition and sufficiency loops.

This is less a model trick and more an operational blueprint for high‑stakes QA.

The workflow at a glance

Role/Step	What it does	Failure it prevents
Query Agent	Parses the user prompt into structure (issue type, entities, time window, jurisdiction); generates search intents.	Ambiguous scope, wrong jurisdiction.
Search Agent	Runs agentic search: quick snippet triage → deep content extraction (top‑k), plus local BM25 and CourtListener case retrieval.	Shallow snippets, missing primary law.
Judge Agent	Deterministic (T=0) checklist: authority quality (gov/court/edu), date recency, jurisdiction match, contradiction scan; decides SUFFICIENT/INSUFFICIENT and proposes refinements.	Answering on flimsy, outdated, or off‑jurisdiction sources.
Summary Agent	Synthesizes the final answer after sufficiency is met; cites evidence.	Hallucinated conclusions, weak grounding.

Operating modes

Dimension	Simple Mode	Multi‑Turn Mode
Turn structure	Single pass	Iterative search → judge → refine loops
Latency	Low	Higher
Recall & grounding	Good	Best (keeps searching until sufficient)
When to use	Triage, FAQs	High‑stakes questions, changing regs

Evidence sources that matter

Web (Serper): fast exploration via titles/snippets, then enhanced extraction of the best pages (snippet‑anchored windows to keep citations tight).
Local RAG (BM25): rolling index of your private policy memos, playbooks, or annotated statutes.
CourtListener API: primary case law for authoritative grounding when blogs and secondary summaries aren’t enough.

Design choice I love: the snippet‑anchored extraction window that pulls just enough surrounding context to cite precisely, without bloating tokens.

Do the numbers justify the complexity?

Reported results on LegalSearchQA (200 up‑to‑date questions):

Method	Accuracy	Uncertainty (U‑Score ↓)	LLM‑as‑Judge	Avg. Latency (s)
GPT‑4o (pure)	0.89	0.55	Moderate	1.69
Claude‑4‑Sonnet (pure)	0.88	0.62	Moderate	3.84
Gemini‑2.5‑Flash (pure)	0.86	0.58	Moderate	1.87
L‑MARS (Simple)	0.96	0.42	High	13.62
L‑MARS (Multi‑Turn)	0.98	0.39	High	55.67

Interpreting U‑Score: it penalizes hedging (“may”), temporal vagueness (no dates), poor citation authority, weak jurisdiction specificity, and indecision. Lower is better, so the multi‑turn loop isn’t just more correct—it’s also more decisive with better citations.

Trade‑off: Multi‑turn mode is ~10–30× slower than pure LLM responses. That’s acceptable for counsel‑facing tools and batch memo generation; for chat triage, start Simple and escalate when needed.

Why this matters beyond law

If your domain changes weekly (tax, benefits, cloud pricing, compliance, HR policy), static fine‑tunes drift and single‑pass RAG misses edge cases. A judge‑gated search loop generalizes:

Healthcare ops: coverage criteria and prior auth rules change frequently; judge‑gated “sufficiency” avoids outdated guidance.
FinServ compliance: product disclosure rules and sanctions lists; the judge enforces jurisdiction & date checks before answer.
Procurement & InfoSec: policy exceptions and vendor clauses; contradiction scans catch conflicting sources.

Enterprise implementation notes (what to copy, what to adapt)

Copy verbatim

Sufficiency checklists (authority, date, jurisdiction, contradictions) with a temperature‑0 Judge.
Two‑tier search (fast snippets → deep extraction) to keep latency predictable.
Snippet‑anchored windows to improve citation precision.

Adapt to your stack

Replace Serper with your search infra; swap BM25 for your vector store but keep a BM25 or keyword fallback for legalese and exact cites.
Add data‑residency lanes (EU/US) so the Judge can reject sources on compliance grounds.
Log judge rationales and stopping decisions for audit trails.

Pitfalls

Over‑conservative judges can loop too long. Use iteration caps and leniency ramps (be stricter in early loops, looser later) to stop on “good enough.”
Snippet drift: ensure your deep‑extraction window really surrounds the snippet anchor, not just a fuzzy match.

Build‑vs‑Buy quick guide

If this is you…	Do this
You already have a legal content license & a case‑law API	Plug L‑MARS‑style Judge + agentic search into your stack; start with Simple Mode in prod, enable Multi‑Turn for escalations.
You’re RAG‑first with vector search only	Add a keyword/BM25 lane and snippet‑anchored extraction; train your Judge to down‑rank forums for primary‑law queries.
You operate in multiple jurisdictions	Encode jurisdiction in the Query Agent schema; blocklist irrelevant TLDs and prefer gov/edu/court domains.

What I’m still watching

Trajectory‑aware judges that consider prior iterations (to avoid refetching the same near‑duplicates).
Cross‑jurisdiction scaling (EU + US + APAC) without slowing to a crawl.
Latency budgets: dynamic switching between modes based on risk/urgency.

Bottom line

L‑MARS shows that architecture beats vibes: a judge‑gated, agentic‑search loop can systematically reduce hallucinations where it matters. If you answer questions where being “roughly right” is legally wrong, this blueprint deserves a pilot.

Cognaptus: Automate the Present, Incubate the Future

What’s actually new here#

The workflow at a glance#

Evidence sources that matter#

Do the numbers justify the complexity?#

Why this matters beyond law#

Enterprise implementation notes (what to copy, what to adapt)#

Build‑vs‑Buy quick guide#

What I’m still watching#

Bottom line#